Tag Archives: hiring

Road War Story – Hacking Inflight Solutions

 

The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.

 

 

12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.

Best of Guide – Highlights of Our Popular Content

We cherry pick the top 5 most popular posts of various topics we’ve covered in recent months.

Top MySQL DBA interview questions (Part 1)

MySQL DBA interview questions

Also find Sean Hull’s ramblings on twitter @hullsean.

MySQL DBAs are in greater demand now than they’ve ever been. While some firms are losing the fight for talent, promising startups with a progressive bent are getting first dibs with the best applicants. Whatever the case, interviewing for a MySQL DBA is a skill in itself so I thought I’d share a guide of top MySQL DBA interview questions to help with your screening process.
It’s long and detailed with some background to give context so I will be publishing this in two parts.

The history of the DBA as a career

In the Oracle world of enterprise applications, the DBA has long been a strong career path. Companies building their sales staff required Peoplesoft or SAP, and those deploying the financial applications or e-business suite needed operations teams to manage those systems. At the heart of that operations team were database administrators or DBAs, a catchall title that included the responsibility of guarding your businesses crown jewels. Security of those data assets, backups, management and performance were all entrusted to the DBA.

In the world of web applications, things have evolved a bit differently. Many a startup are driven only by developers. In those smaller shops, Operations tasks are designated to one developer who takes on the additional responsibility of managing systems. In that scenario, Operations or DBA duties becomes a sort of secondary role to the primary one of building the application. Even in cases where the startup creates a specific operations role with one person managing systems administration, chances are they don’t also have DBA experience. Instead, these startups are more likely to manage the database as a typical Linux application.

When I grow up I (don’t) want to be a MySQL DBA

Where do they come from, and why don’t a lot of computer science folks gravitate towards operations, and DBA? This may be in part due to and the romance of certain job roles which we discussed in a past article, The Mythical MySQL DBA. This pattern appeared a lot in the Oracle world as well. Many folks who were career DBAs actually moved to that role from the business side. In fact you’d find that many didn’t have a computer science or engineering background in the first place. In my experience I saw many Linux and Unix administrators with a stronger foundation who would fit into the DBA role but were simply not interested in it. The same can be said of the MySQL side of the house. Computer science grads don’t get out of school aiming for a career in ops or as a DBA because it has never been regarded as the pinnacle. It’s typically the PROGRAMMERS who become the rockstars in a cool startup.

But as the Internet grows into a richer and more complex medium, things are changing. People talk about scalability, high availability, zero downtime and performance tuning. When brief outages cost millions in losses expectations are very high and that requires skilled, experienced DBAs.

We’ve made a list of comprised of skill questions, general questions and ‘good-to-know’ questions. Have fun grilling your candidate with them, although bear in mind that with interviews it’s not about knowing it all, rather how the person demonstrates critical thinking skills.

Skills Questions

  1. Why are SQL queries so fundamental to database performance?
  2. This is the one question which a DBA should have an answer to. If they can’t answer this question, they’re unlikely to be a good candidate.

    After a MySQL server is setup and running, with many of the switches and dials set to use memory, and play well with other services on the Linux server, queries remain an everyday challenge. Why is this?

    SQL queries are like little programs in and of themselves. They ask the database server to collect selections of records, cross tabulate them with other collections of records, then sort them, and slice and dice them. All of this requires MySQL to build temporary tables, perform resource intensive sorts and then organize the output in nice bite size chunks.

    Unfortunately there are many ways to get the syntax and the results right, yet not do so efficiently. This might sound like a moot point, but with modern websites you may have 5000 concurrent users on your site, each hitting pages that have multiple queries inside them.

    What makes this an ongoing challenge is that websites are typically a moving target, with business requirements pushing new code changes all the time. New code means new queries, which pose ongoing risks to application stability.

  3. Indexes – too many, too few; what’s the difference?
  4. Indexes are very important to the smooth functioning of a relational database. Imagine your telephone book of yore. I can look up all the people with last name of “Hull” in Manhattan because I have the proper index. But most yellow pages don’t include an index for *first* names even though they might occaisionally come in handy, for example with the names “Star” or “Persephone”.

    You can imagine that, if you had a phone book which you maintain and update, everytime you add or remove a name you also have to update the index. That’s right, and the same goes for your relational database.

    So therein lies the trade off, and it’s an important one. When you are *modifying* your data, adding, updating or removing records, you must do work to keep the index up to date. More indexes mean more work. However when you’re looking up data or *querying* in SQL speak, more indexes mean more ways of looking up data fast. One more trade off is that indexes take up more pages in your phonebook, and so too they take up more space on disk.

  5. Backup & Recovery – explain various types & scenarios for restore
  6. Backups come in a few different flavors that the DBA should be familiar with.

    Cold backups involve shutdown down the database server (mysqld) and then backing up all the data files by making a copy of them to another directory. To be really thorough, the entire datadir including binlogs, log files, /etc/my.cnf config file should also be backed up. The cold backup is a database in itself, and can be copied to an alternate server and mounted as-is.

    Logical backups involve using the mysqldump tool. This locks tables while it runs to maintain consistency of changing data, and can cause downtime. The resulting dump file contains CREATE DATABASE, CREATE TABLE & CREATE INDEX statements to rebuild the database. Note the file itself is not a database, but rather a set of instructions which can tell a MySQL server *HOW* to reconstruct the database. Important distinction here.

    Hot backups are a great addition to the mix as they allow the physical database data files to be backed up *WHILE* the server is up and running. In MySQL this can be achieved with the xtrabackup tool, available from Percona. Despite the name, it works very well with MyISAM and InnoDB tables too, so don’t worry if you’re not using xtradb tables.

    There are a few different restore scenarios, and the candidate should be able to describe how these various backups can be restored, and what the steps to do so would be. In addition they should understand what point-in-time recovery is, and how to perform that as well. After restoring one of the above three backup types, the DBA would use the mysqlbinlog utility to apply any subsequent transactions from the binary logs. So if the backup was made at 2am last night, and you restore that backup, the mysqlbinlog tool would be used to dig up transactions since 2am, and apply them to that restored database.

  7. Troubleshooting Performance
  8. Since this is an ongoing challenge with relational databases, a good grasp of it is crucial. One way to challenge the candidate would be to describe a recent performance problem you experienced with your infrastructure, and ask them how they would go about resolving it.

    If they struggle with the particulars of what you ran into, ask them to describe a big performance challenge they solved, what the cause was, and how they performed analysis.

    Typically, first steps involve mitigating the immediate problem by finding out what changed in the environment either operationally or code changes. If there is a bug that was hit, or other strange performance anomaly, the first stop is usually looking at log files. MySQL server error logs, and the slow query log are key files. From there, analyzing those files during the timeframe where problems occurred should yield some clues.

    You might also hope to hear some comment about metrics collection in this discussion. Tools such as cacti, munin, opennms, or ganglia are invaluable tools for drilling down on a past event or outage, and sifting through server stats to find trouble.

  9. Joins – describe a few kinds and how the server performs them
  10. A basic understanding of INNER JOIN and OUTER JOIN would be a great start. A simple example might be employees and departments. If you have four employees and two departments, an INNER JOIN of these tables together will give you the departments employees belong to. Add another employee without assigning her to a department, and the inner join won’t display her. Further adding a new department which doesn’t yet contain employees won’t display either. However performing an OUTER JOIN will give you those matches with null in the department field, and null in the employee field respectively.

    Thought of with another example, take a credit card company. One tables contains cardholders identity, their number, address, and other personal information. A second table contains their account activity. When they first join, they don’t have any monthly statements, so an INNER JOIN of cardholders with statements will yield no rows. However an OUTER JOIN on those two tables will yield a record, with a null for the statements columns.

Feeling like a MySQL expert yet? In Part 2 of Top MySQL DBA Interview Questions we’ll walkthrough four more questions plus a bonus.

How to hire a developer that doesn't suck

xkcd_goodcode
Strip by Randall Munroe; xkcd.com

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

By the way we also have a MySQL DBA Interview Questions article which is quite popular.

Also take a look at our AWS & EC2 Interview questions piece.

Lastly we have a great Oracle DBA Hiring Guide.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability. Continue reading How to hire a developer that doesn't suck

8 Questions to ask an AWS Expert

via GIPHY

If you’re headhunting a cloud computing expert, specifically someone who knows Amazon Web Services (AWS) and EC2, you’ll want to have a battery of questions to ask them to assess their knowledge.  As with any technical interview focus on concepts and big picture.  As the 37Signals folks like to say “hire for attitude, train for skill”.  Absolutely!

New: Top questions for hiring a serverless lambda expert

Also new: Top questions to ask on a devops expert interview

And: How to hire a developer that doesn’t suck

If you want more general info about Amazon Web Services, read our Intro to EC2 Deployments.

    1. Explain Elastic Block Storage?  What type of performance can you expect?  How do you back it up?  How do you improve performance?

    EBS is a virtualized SAN or storage area network.  That means it is RAID storage to start with so it’s redundant and fault tolerant.  If disks die in that RAID you don’t lose data.  Great!  It is also virtualized, so you can provision and allocate storage, and attach it to your server with various API calls.  No calling the storage expert and asking him or her to run specialized commands from the hardware vendor.

    Performance on EBS can exhibit variability.  That is it can go above the SLA performance level, then drop below it.  The SLA provides you with an average disk I/O rate you can expect.  This can frustrate some folks especially performance experts who expect reliable and consistent disk throughput on a server.  Traditional physically hosted servers behave that way.  Virtual AWS instances do not.

    Related: Is Amazon too big to fail?

    Backup EBS volumes by using the snapshot facility via API call or via a GUI interface like elasticfox.

    Improve performance by using Linux software raid and striping across four volumes.

    2. What is S3?  What is it used for? Should encryption be used?

    S3 stands for Simple Storage Service.  You can think of it like ftp storage, where you can move files to and from there, but not mount it like a filesystem.  AWS automatically puts your snapshots there, as well as AMIs there.  Encryption should be considered for sensitive data, as S3 is a proprietary technology developed by Amazon themselves, and as yet unproven vis-a-vis a security standpoint.

    3. What is an AMI?  How do I build one?

    AMI stands for Amazon Machine Image.  It is effectively a snapshot of the root filesystem.  Commodity hardware servers have a bios that points the the master boot record of the first block on a disk.  A disk image though can sit anywhere physically on a disk, so Linux can boot from an arbitrary location on the EBS storage network.

    Need an AWS expert? Email me for a quote hullsean @ gmail.com

    Build a new AMI by first spinning up and instance from a trusted AMI.  Then adding packages and components as required.  Be wary of putting sensitive data onto an AMI.  For instance your access credentials should be added to an instance after spinup.  With a database, mount an outside volume that holds your MySQL data after spinup as well.

    4. Can I vertically scale an Amazon instance? How?

    Yes.  This is an incredible feature of AWS and cloud virtualization.  Spinup a new larger instance than the one you are currently running.  Pause that instance and detach the root ebs volume from this server and discard.  Then stop your live instance, detach its root volume.  Note the unique device ID and attach that root volume to your new server.   And the start it again.  Voila you have scaled vertically in-place!!

    5. What is auto-scaling? How does it work?

    Autoscaling is a feature of AWS which allows you to configure and automatically provision and spinup new instances without the need for your intervention.  You do this by setting thresholds and metrics to monitor.  When those thresholds are crossed a new instance of your choosing will be spun up, configured, and rolled into the load balancer pool.  Voila you’ve scaled horizontally without any operator intervention!

    Also: Are we fast approaching cloud-mageddon?

    With MySQL databases autoscaling can get a little dicey, so we wrote a guide to autoscaling MySQL on amazon EC2.

    6. What automation tools can I use to spinup servers?

    The most obvious way is to roll-your-own scripts, and use the AWS API tools.  Such scripts could be written in bash, python or another language or your choice.  Next option is to use a configuration management and provisioning tool like puppet or better it’s successor Opscode Chef. Ansible is also an excellent option because it doesn’t require an agent, and can run your shell scripts as-is.  You might also look towards CloudFormation or Terraform. The resulting code captures your entire infrastructure, can be checked into your git repository & version controlled. You can even unit test this way! 

    7. What is configuration management?  Why would I want to use it with cloud provisioning of resources?

    Configuration management has been around for a long time in web operations and systems administration.  Yet the cultural popularity of it has been limited.  Most systems administrators configure machines as software was developed before version control – that is manually making changes on servers.  Each server can then and usually is slightly different.  Troubleshooting though is straightforward as you login to the box and operate on it directly.  Configuration management brings a large automation tool into the picture, managing servers like strings of a puppet.  This forces standardization, best practices, and reproducibility as all configs are versioned and managed.  It also introduces a new way of working which is the biggest hurdle to its adoption.

    Read: When hosting data on Amazon turns bloodsport

    Enter the cloud, and configuration management becomes even more critical.  That’s because virtual servers such as amazons EC2 instances are much less reliable than physical ones.  You absolutely need a mechanism to rebuild them as-is at any moment.  This pushes best practices like automation, reproducibility and disaster recovery into center stage.

    While on the subject of configuration management take a quick peek at hiring a devops guide.

    8. Explain how you would simulate perimeter security using Amazon Web Services model?

    Traditional perimeter security that we’re already familiar with using firewalls and so forth is not supported in the Amazon EC2 world.  AWS supports security groups.  One can create a security group for a jump box with ssh access – only port 22 open.  From there a webserver group and database group are created.  The webserver group allows 80 and 443 from the world, but port 22 *only* from the jump box group.  Further the database group allows port 3306 from the webserver group and port 22 from the jump box group.  Add any machines to the webserver group and they can all hit the database.  No one from the world can, and no one can directly ssh to any of your boxes.

    The more full featured way to go is VPC. That’s Amazon’s acronym for virtual private cloud. You can create virtual networks both private & public, with subnets etc all within VPCs. You then spinup servers & resources inside those virtual networks. VPCs can be control with security groups or the more powerful but messy access control lists.

    Also: A history lesson for cloud detractors – January 2012

    Want to further lock this configuration down?  Only allow ssh access from specific IP addresses on your network, or allow just your subnet.

Did you make it this far?!?! Grab our newsletter.

Oracle DBA Interview Questions

Oracle Database Administrator or often called DBAs are an indispensable part of your operations team. They manage the systems that house all your business data, your customers, products, transactions and all that analytical data on what customers are actually doing. If you’ve ever been on the hunt, you may wonder, why the shortage of DBAs? To that we’ll answer, have you ever heard of Dustin Moskovitz?

So you certainly want to entrust that to someone who knows what they’re talking about. Enter the Oracle DBA Interview, a process that some will see as a technical test, while others will see as a fit of personalities, behaviors, and work ethic.

From the technical side we thought we’d bring you a quick and dirty checklist of questions. This isn’t an exhaustive list by any means, but is a good place to start and will certainly provide you with a glimpse of their knowledge.

Also if you’re looking to hire a MySQL DBA here’s a guide, and also one for hiring and EC2 expert.

1. What is the difference between RMAN and a traditional hotbackup?

RMAN is faster, can do incremental (changes only) backups, and does not place tablespaces into hotbackup mode.

2. What are bind variables and why are they important?

With bind variables in SQL, Oracle can cache related queries a single time in the SQL cache (area). This avoids a hard parse each time, which saves on various locking and latching resources we use to check objects existence and so on. BONUS: For rarely run queries, especially BATCH queries, we explicitely DO NOT want to use bind variables, as they hide information from the Cost Based Opitmizer.

BONUS BONUS: For batch queries from 3rd party apps like peoplesoft, if we can’t remove bind variables, we can use bind variable peeking!

3. In PL/SQL, what is bulk binding, and when/how would it help performance?

Oracle’s SQL and PL/SQL engines are separate parts of the kernel which require context switching, like between unix processes. This is slow, and uses up resources. If we loop on an SQL statement, we are implicitely flipping between these two engines. We can minimize this by loading our data into an array, and using PL/SQL bulk binding operation to do it all in one go!

4. Why is SQL*Loader direct path so fast?

SQL*Loader with direct path option can load data ABOVE the high water mark of a table, and DIRECTLY into the datafiles, without going through the SQL engine at all. This avoids all the locking, latching, and so on, and doesn’t impact the db (except possibly the I/O subsystem) at all.

5. What are the tradeoffs between many vs few indexes? When would you want to have many, and when would it be better to have fewer?

Fewer indexes on a table mean faster inserts/updates. More indexes mean faster, more specific WHERE clauses possibly without index merges.

6. What is the difference between RAID 5 and RAID 10? Which is better for Oracle?

RAID 5 is striping with an extra disk for parity. If we lose a disk we can reconstruct from that parity disk.

RAID 10 is mirroring pairs of disks, and then striping across those sets.

RAID 5 was created when disks were expensive. Its purpose was to provide RAID on the cheap. If a disk fails, the IO subsystem will perform VERY slowly during the rebuild process. What’s more your liklihood of failure increases dramatically during this period, with all the added weight of the rebuild. Even when it is operating normally RAID 5 is slow for everything but reading. Given that and knowing databases (especially Oracle’s redo logs) continue to experience write activity all the time, we should avoid RAID5 in all but the rare database that is MOSTLY read activity. Don’t put redologs on RAID5.

RAID10 is just all around goodness. If you lose one disk in a set of 10 for example, you could lose any one of eight other disks and have no troubles. What’s more rebuilding does not impact performance at all since you’re simply making a mirror copy. Lastly RAID10 perform exceedingly well in all types of databases.

7. When using Oracle export/import what character set concerns might come up? How do you handle them?

Be sure to set NLS_LANG for example to “AMERCIAN_AMERICA.WE8ISO8859P1”. If your source database is US7ASCII, beware of 8-bit characters. Also be wary of multi-byte characters sets as those may require extra attention. Also watch export/import for messages about any “character set conversions” which may occur.

8. How do you use automatic PGA memory management with Oracle 9i and above?

Set the WORKAREA_SIZE_POLICY parameter to AUTO and set PGA_AGGREGATE_TARGET

9. Explain two easy SQL optimizations.

a. EXISTS can be better than IN under various conditions

b. UNION ALL is faster than UNION (not sorting)

10. Name three SQL operations that perform a SORT.

a. CREATE INDEX

b. DISTINCT

c. GROUP BY

d. ORDER BY

f. INTERSECT

g. MINUS

h. UNION

i. UNINDEXED TABLE JOIN

11. What is your favorite tool for day-to-day Oracle operation?

Hopefully we hear some use of command line as the answer!

12. What is the difference between Truncate and Delete? Why is one faster?

Can we ROLLBACK both? How would a full table scan behave after?

Truncate is nearly instantaenous, cannot be rolled back, and is fast because Oracle simply resets the HWM. When a full table scan is performed on a table, such as for a sort operation, Oracle reads to the HWM. So if you delete every single solitary row in 10 million row table so it is now empty, sorting on that table of 0 rows would still be extremely slow.

13. What is the difference between a materialized view (snapshot) fast refresh versus complete refresh? When is one better, and when the other?

Fast refresh maintains a change log table, which records change vectors, not unlike how the redo logs work. There is overhead to this, as with a table that has a LOT of indexes on it, and inserts and updates will be slower. However if you are performing refreshes often, like every few minutes, you want to do fast refresh so you don’t have to full-table-scan the source table. Complete refresh is good if you’re going to refresh once a day. Does a full table scan on the source table, and recreats the snapshot/mview. Also inserts/updates on the source table are NOT impacted on tables where complete refresh snapshots have been created.

14. What does the NO LOGGING option do? Why would we use it? Why would we be careful of using it?

It disables the logging of changes to the redologs. It does not disable ALL LOGGING, however as Oracle continues to use a base of changes, for recovery if you pull the plug on the box, for instance. However it will cause problems if you are using standby database. Use it to speed up operations, like an index rebuild, or partition maintenance operations.

15. Tell me about standby database? What are some of the configurations of it? What should we watch out for?

Standby databases allow us to create a copy of our production db, for disaster recovery. We merely switch mode on the target db, and bring it up as read/write. Can setup as master->slave or master->master. The latter allows the former prod db to become the standby, once the failure cause is remedied. Watch out for NO LOGGING!! Be sure we’re in archivelog mode.

Hey you! If you made it this far, definitely grab our newsletter.