Best Practices, the term we hear thrown around a lot. But like going on that new years diet, too often ends up more talk than action.
Operator error ie typing the wrong command is always a risk. Logging into the wrong server to drop a database or typing the dump command such that you dump data into the database, these are risks that operations folks face everyday.
Accountability is important, be sure all of your systems folks login to their own accounts. Apply the least privileges model, give permissions on an as needed basis.
Set prompts with big bold names that indicate production servers and their purpose. Automate repetitive commands that are prone to typos.
Don’t be afraid to give developers read-only accounts on production servers.
Regular team meetings, a la the Agile stand ups are a great way to encourage folks to communicate. Bring the developers and operations folks together. Ask everyone in turn to voice their current todos, their concerns and risks they see. Encourage everyone to listen with an open mind. Consider different perspectives.
Communication is a cultural attribute. So it comes from the top. Encourage this as a CTO or CIO by asking questions, communicating your concerns, repeat your own requests in different words and paraphrase. Listen to what your team is saying, repeat and rephrase those concerns, and how and when they will be addressed.
A culture of documenting services, and processes is healthy. It provides a central location and knowledge base for the team. It also prevents sliding into the situation where only one team member understands how to administer critical business components. Were that person to be unavailable or to leave the company, you’re stuck reverse engineering your infrastructure and guessing at architectural decisions.
Rather than think of best practices as something you need to achieve today, think of it as an ongoing day-to-day quest for improvement.
- repetitive manual processes – employ automation & script those processes where possible.
- where steps require investigation and research – document it
- where production changes are involved – communicate with business units, qa & operations
- always be improving – striving for better practices
In search of a good book on Chef itself, I picked up this new title on O’Reilly. It’s one of their new format books, small in size, only 75 pages.
There was some very good material in this book. Mr. Nelson-Smith’s writing style is good, readable, and informative. The discussion of risks of infrastructure as code was instructive. With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one. So an honest discussion of the risks of such an approach is bold and much needed. I also liked the introduction to Chef itself, and the discussion of installation.
Chef isn’t really the main focus of this book, unfortunately. The book spends a lot of time introducing us to Agile Development, and specifically test driven development. While these are lofty goals, and the first time I’ve seen treatment of the topic in relation to provisioning cloud infrastructure, I did feel too much time was spent on that. Continue reading
Software development is composed of individual components. As developers are building these units, they build tests to verify them for correctness. These tests can verify the environment, they can verify data, they can verify edge cases and include test harnesses. In essence they verify that the code meets the design specification.
There are a few key advantages to the unit testing approach:
- Self-Documenting – The tests themselves provide a type of documentation for the system as a whole.
- Advances Refactoring – At a later date you may need to repair, rewrite or refactor portions of code. Previously built unit tests provide a tremendous help to make sure your changes still meet the previous design specification.
- Simplifies Functional Testing – With unit testing as an ongoing concern, the final components will likely perform more reliably, and if not the tests & self-documentation may point to how or why they fail to meet some specification.
Sean Hull Quora Discussion – What is software unit testing?
Amazon Web Services is a division of Amazon the bookseller, but this part of the business is devoted solely to infrastructure and internet servers. These are the building blocks of data centers, the workhorses of the internet. AWS’s offering of Cloud Computing solutions allows a business to setup or “spinup” in the jargon of cloud computing, new compute resources at will. Need a small single cpu 32bit ubuntu server with two 20G disks attached? One command and 30 seconds away, and you can have that!
As we discussed previously, Infrastructure Provisioning has evolved dramatically over the past fifteen years from something took time and cost a lot, to a fast automatic process that it is today with cloud computing. This has also brought with it a dramatic culture shift in the way that systems administration is being done, from a fairly manual process of physical machines, and software configuration, one that took weeks to setup new services, to a scriptable and automateable process that can then take seconds.
This new realm of cloud computing infrastructure and provisioning is called Infrastructure as a Service or IaaS, and Amazon Web Services is one of the largest providers of such compute resources. They’re not the only ones of course. Others include:
- Rackspace Cloud
Cloud Computing is still in it’s infancy, but is growing quickly. Amazon themselves had a major data center outage in April that we discussed in detail. It sent some hot internet startups into a tailspin!
More discussion of Amazon Web Services on Quora – Sean Hull
IOPs are an attempt to standardize comparison of disk speeds across different environments. When you turn on a computer, everything must be read from disk, but thereafter things are kept in memory. However applications typically read and write to disk frequently. When you move to enterprise class applications, especially relational databases, a lot of disk I/O is happening so performance of disk resources is crucial.
For a basic single SATA drive that you might have in server or laptop, you can typically get 30-40 IOPs from it. These numbers vary if you are talking about random versus sequential reads or writes. Picture the needle on a vinyl record. It moves quicker around the center, and slower around the outside. That’s what’s happening the the magnetic needle inside your harddrive too.
In Amazon EC2 environment, there is a lot of variability in performance from EBS. You can stripe across four separate EBS volumes which will be on four different locations on the underlying RAID array and you’ll get a big boost in disk I/O. Also disk performance will vary from an m1.small, m1.large and m1.xlarge instance type, with the latter getting the lions share of network bandwidth, so better disk I/O performance. But in the end your best EBS performance will be in the range of 500-1000 IOPs. That’s not huge by physical hardware standards, so an extremely disk intensive application will probably not perform well in the Amazon cloud.
Still the economic pressures and infrastructure and business flexibility continue to push cloud computing adoption, so expect the trend to continue.
Quora discussion – What are IOPs and why are they important?
So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.
Source systems may store data in an infinite variety of formats. Extracting involves getting that data into common files for moving to the destination system. CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well. In MySQL INTO OUTFILE syntax can perform this function. If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell. In Oracle you would use the spool command in SQL*Plus the command line shell. Spool sends subsequent output from the screen also to a file.
This step involves modifying the extracted data in preparation for moving it into the target database server. It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data. You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server. Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.
Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with. Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed. Then there is sort which can sort data and send the results to stdout or the file of your choosing. Other tools include wc – word count, cut which can remove columns and so forth.
This final step involves moving the data into the database server, and it’s final target tables. For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.
Quora discussion by Sean Hull – What is ETL?
Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything. That’s why automated monitoring is so important.
So what should you monitor? You can divide up your monitoring into a couple of strategic areas. Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.
Business & Application Monitoring
- If a user is getting an error page or cannot connect
- If an e-commerce transaction is failing
- General service outages
- If a business goal is met – or not
- Page timeouts or slowness
Systems Level Monitoring
- Backups completed and success
- Error logs from database, webserver & other major services like email
- Database replication is running
- Webserver timeouts
- Database timeouts
- Replication failures – via error logs & checksum checks
- Memory, CPU, Disk I/O, Server load average
- Network latency
- Network security
Tools that can perform this type of monitoring include Nagios,
Quora discussion – Web Operations Monitoring
Web-facing database servers receive a barrage of activity 24 hours a day. Sessions are managed for users logging in, ratings are clicked and comments are added. Even more complex are web-based ecommerce applications. All of this activity is organized into small chunks called transactions. They are discrete sets of changes. If you’re editing a word processing document, it might autosave every five minutes. If you’re doing something in excel it may provide a similar feature. There is also an in-built mechanism for undo and redo of recent edits you have made. These are all analogous to transactions in a database.
These are important because all of these transactions are written to logfiles. They make replication possible, by replaying those changes on another database server downstream.
If you have lost your database server because of hardware failure or instance failure in EC2, you’ll be faced with the challenge of restoring your database server. How is this accomplished? Well the first step would be to restore from the last full backup you have, perhaps a full database dump that you perform everyday late at night. Great, now you’ve restored to 2am. How do I get the rest of my data?
That is where point-in-time recovery comes in. Since those transactions were being written to your transaction logs, all the changes made to your database since the last full backup must be reapplied. In MySQL this transaction log is called the binlog, and there is a mysqlbinlog utility that reads the transaction log files, and replays those statements. You’ll tell it the start time – in this case 2am when the backup happened. And you’ll tell it the end time, which is the point-in-time you want to recover to. That time will likely be the time you lost your database server hardware.
Point-in-time recovery is crucial to high availability, so be sure to backup your binlogs right alongside your full database backups that you keep every night. If you lose the server or disk that the database is hosted on, you’ll want an alternate copy of those binlogs available for recovery!
Quora discussion on Point-in-time Recovery by Sean Hull
A data warehouse is a special type of database. It is used to store large amounts of data, such as analytics, historical, or customer data, and then build large reports and data mining against it. It is markedly different from a web-facing or high-transaction database, which typically has many many small transactions or pieces of data that are constantly changing, through many 100′s or 1000′s or small user sessions. These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute.
Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables. Often full table scans are required because of the specialized one-off nature of these reports. The same queries are not executed over and over.
It’s important not to mix data warehousing databases with transactional databases in the same instance, whether you are dealing with MySQL or Oracle. That’s because they are tuned totally differently. It would be like trying to use the same engine for commuting to work, and a container ship traveling around the world. Different jobs require different databases or databases that with their dials set for different uses.
Quora discussion of data warehousing – Sean Hull
A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses on performance when compared to physical servers of equal cost. They argue that you can get much more hardware and bigger iron for the same money. So it then seems foolhardy to turn to the cloud. Why this mad rush to the cloud then? Of course if all you’re looking at is performance, it might seem odd indeed. But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.
CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”
Another question, “Is it a good idea to deploy your database in the cloud?” It depends! Let’s take a look at some of the strengths and weaknesses, then you decide.
8 big strengths of the cloud
- Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
- Easier roll out of patches and upgrades
- Reduced operational headache – scripting and automation becomes central
- Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
- Low initial investment
- Auto-scaling – set thresholds and deploy new capacity automatically
- Easy compromise response – take server offline and spinup a new one
- Easy setup of dev, qa & test environments
Some challenges with deploying in the cloud
- Big cultural shift in how operations is done
- Lower SLAs and less reliable virtual servers – mitigate with automation
- No perimeter security – new model for managing & locking down servers
- Where is my data? — concerns over compliance and privacy
- Variable disk performance – can be problematic for MySQL databases
- New procurement process can be a hurdle
Many of these challenges can be mitigated against. The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms. Mitigate the weaknesses of the cloud by:
- Use encrypted filesystems and backups where necessary
- Also keep offsite backups inhouse or at an alternate cloud provider
- Mitigate against EBS performance – cache at every layer of your application stack
- Employ configuration management & automation tools such as Puppet & Chef
Quora discussion – Why or why not to migrate to the cloud?