Tag Archives: oracle

How to interview an amazon database expert

via GIPHY

Amazon releases a new database offering every other day. It sure isn’t easy to keep up.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Let’s say you’re hiring a devops & you want to suss out their database knowledge? Or you’re hiring a professional services firm or freelance consultant. Whatever the case you’ll need to sift through for the best people. Here’s how.

Also: How to interview an AWS expert

What database does Amazon support for caching?

Caching is a popular way to speed up access to your backend database. Put Amazon’s elasticache behind your webserver, and you can reduce load on your database by 90%. Nice!

The two types that amazon supports are Memcache & Redis. Memcache is historically more popular. These days Redis seems a clear winner. It’s faster, and can maintain your cached data between restarts. That will save you I promise!

Also: Is AWS too complex for small dev teams?

How can I store big data in AWS?

Amazon’s data warehouse offering is called Redshift. I wrote Why is everyone suddenly talking about Redshift?. Why indeed!

When you’re doing large reports for your business intelligence team, you don’t want to bog down your backend relational database. Redshift is purpose built for this use case.

I’ve see a report that took over 8 hours in MySQL return in under 60 seconds in Redshift!

A new offering is Amazon Spectrum. This tech is super cool. Load up all your data into S3, in standard CSV format. Then without even loading it into Redshift, you can query the S3 data directly. This is super useful. Firstly because S3 is 1/10th the price. But also because it allows you to stage your data before loading into Redshift itself. Goodbye Google Big Query! I talked about spectrum here.

Related: Which engineering roles are in greatest demand?

What relational database options are there on Amazon?

Amazon supports a number of options through it’s Relational Database Service or RDS. This is managed databases, which means less work on your DBAs shoulders. It also may make upgrades slower and harder with more downtime, but you get what you pay for.

There are a lot of platforms available. As you might guess MySQL & Postgres are there. Great! Even better you can use MariaDB if that’s your favorite. You can also go with Aurora which is Amazon’s own home-brew drop in replacement for MySQL that promises greater durability and some speedups.

If you’re a glutton for punishment, you can even get Oracle & SQL Server working on RDS. Very nice!

Read: Can on-demand consulting save startups time & money?

Does AWS have a NoSQL database solution?

If NoSQL is to your taste, Amazon has DynamoDB. According to . I haven’t seen a lot of large production applications using it, but what he describes makes a lot of sense. The way Amazon scales nodes & data I/O is bound to run into real performance problems.

That said it can be a great way to get you up and running quickly.

Read: Can on-demand consulting save startups time & money?

How do I do ETL & migrate data to AWS?

Let’s be honest, Amazon wants to make this really easy. The quicker & simpler it is to get your data there, that more you’ll buy!

Amazon’s Database Migration Service or DMS allows you to configure your old database as a data source, then choose a Amazon db solution as destination, then just turn on the spigot and pump your data in!

ETL is extract transform and load, data warehouse terminology for slicing and dicing data before you load it into your warehouse. Many of todays warehouses are being built with the data lake model, because databases like Redshift have gotten so damn fast. That model means you stage all your source data as-is in your warehouse, then build views & summary tables as needed to speed up queries & reports. Even better you might look a tool like xplenty.

Amazon’s new offering is called Glue. Five ways to get data into Amazon Redshift. This solution is purpose build for creating a powerful data pipeline, complete with python code to do transformations.

Read: Is data your dirty little secret?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to build an operational datastore on AWS with S3 & Redshift

via GIPHY

You’re building your data warehouse, and getting data into Redshift. You’ve got your ETL pipeline running, and presentation layer talking to the warehouse. Great.

But how to get access to that source data? Wouldn’t it be nice if that was close by too?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

Also: Top serverless interview questions for hiring aws lambda experts

Give Redshift permissions

Go into IAM and create a new role called “SeanSpectrumRole”. Assign the policy AmazonS3ReadOnlyPolicy. It looks like this:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}

If you’re using the dashboard you just pick the policy from the named list. However if you’re using CloudFormation, you’ll use the code above.

Now navigate your aws console to the Redshift dashboard, click clusters, and click the checkbox for your cluster. Probably there’s only one.

Now click the “Manage IAM Roles” button, and a dialog should popup.. Select the role you created earlier, SeanSpectrumRole. Then click “Apply Changes”.

The beauty of the AWS world is that servers themselves can have API permissions. In this case we gave the redshift cluster or server itself, access to S3 for our use below!

Related: Which engineering roles are in greatest demand?

Create your spectrum schema

First you must create a spectrum schema. Here’s the syntax:


create external schema spectrum
from data catalog
database 'sean'
region 'us-east-1'
iam_role 'arn:aws:iam::9999999999999:role/SeanSpectrumRole';

Read: Can on-demand consulting save startups time & money?

Upload your data to S3 bucket

Here we create an s3 bucket called sean_spectrum, then upload one csv file named sean_numbers.txt.


$ aws s3api create-bucket --bucket sean_spectrum --region us-east-1
{
"Location": "/sean_spectrum"
}
$ cd spectrum/
$ cat sean_numbers.txt
21,Dr.,Who,44-22-55-77-88
35,Bat,Man,317-222-4777
15,Wonder,Woman,999-324-7878
99,Storm,Cloud,367-399-6767
75,Marvel,Girl,222-333-9595
32,Quick,Silver,22-33-77-99
12,Scarlet,Witch,23-35-47-555
$ aws s3 cp sean_numbers.txt s3://sean_spectrum/
upload: ./sean_numbers.txt to s3://sean_spectrum/sean_numbers.txt
$ aws s3 ls s3://sean_spectrum/
2017-05-18 20:28:41 193 sean_numbers.txt
$

Note the names. The table name won’t turn out to be sean_numbers. It will be called sean_spectrum, and all files inside that directory will be queried. So make sure they have consistent formats!

Also: 30 questions to ask a serverless fanboy

Create & query your external table

Here’s how you create your external table. Note this is just a map to data. The data is still stored in S3. it is not brought into Redshift except to slice, dice & present.


mydb=# create external table spectrum_schema.sean_numbers(
id int,
fname string,
lname string,
phone string)
row format delimited
fields terminated by ','
stored as textfile
location 's3://sean_spectrum/';

Here’s how you query it:


mydb=# select * from spectrum_schema.sean_numbers order by id;
id | fname | lname | phone
----------------+---------------
12 | Scarlet | Witch | 23-35-47-555
15 | Wonder | Woman | 999-324-7878
21 | Dr. | Who | 44-22-55-77-88
32 | Quick | Silver | 22-33-77-99
35 | Bat | Man | 317-222-4777
75 | Marvel | Girl | 222-333-9595
99 | Storm | Cloud | 367-399-6767

Cool. We reordered data read from an S3 file!!!

Although you can’t create a view over a redshift table *AND* an S3 external table, you can query them together.

So for example if I have a table in redshift with addresses, I can join them together:

mydb=# select a.id, a.fname, a.lname, b.address from spectrum_schema.sean_numbers a, sean_addresses b
where a.id = b.id order by id;

id | fname | lname | phone | address
----------------+----------------------------
12 | Scarlet | Witch | 23-35-47-555 | 10 main st
15 | Wonder | Woman | 999-324-7878 | 25 center st
21 | Dr. | Who | 44-22-55-77-88 | 32 broadway
32 | Quick | Silver | 22-33-77-99 | 1 first st
35 | Bat | Man | 317-222-4777 | 99 west st
75 | Marvel | Girl | 222-333-9595 | 66 East Ave
99 | Storm | Cloud | 367-399-6767 | 50 North st

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Amazon about to disrupt your data warehouse?

via GIPHY

Amazon is about to launch a product called glue. As you can see below, this is the last piece in the data warehousing puzzle. With that in place, Amazon will own you! Or at least have push button products to meet all of enterprises varying needs.

Even if you’re a small startup, you can do big-shot big enterprise data warehousing. That means everyone can use cutting edge data driven techniques for product & business decisions.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What is Redshift

Redshift is like the OLAP databases of years past, the Oracle’s of the world purpose built for warehousing data. Obviously without the crazy licensing model Oracle was famous for. With Amazon you can get enterprise class data warehouse for modest hourly prices.

If my recent conversations with recruiters about Redshift demand are any indication, there’s been a sudden uptick in startups looking for redshift expertise.

Also: Top serverless interview questions for hiring aws lambda experts

What is Spectrum?

Spectrum is a very new extension of Redshift allowing you to access & query S3 file data directly. This means you can have petabytes of data that you can access pre-load time. So you will ETL and load portions of it, but with Spectrum you can still access the offline data too.

In the old Oracle days this was called an EXTERNAL TABLE. I mention this only to say that Amazon isn’t doing anything that hasn’t been done before. Rather they’re bringing these advanced features within reach of everyday startups. That’s cool.

Related: Which engineering roles are in greatest demand?

What is glue?

Glue is still in beta, but if the RE:Invent talk above is any indication, it’s set to disrupt an entire industry. Wow!

Glue first catalogs your data sources. What does this mean, it scans them & models their schemas.

It then generates sample python ETL code. Modify it, or write your own. Share your code on Git. Or borrow other open source pieces, that already address your specific ETL use case!

Lastly it includes a job scheduler which handles dependencies. Job A must be completed before B can run and so forth. Error handling & logging are also all included.

Since these are native Amazon services, of course they’re going to integrate with their dangerously fast Redshift warehouse.

Read: Can on-demand consulting save startups time & money?

What is serverless?

I’ve written about how to throw fastballs at a serverless fanboy and even how to hire a serverless expert. But really what is it?

Serverless means deploying functions directly into the cloud. No servers, no configuration. All the systems administration & automation is hidden. No more devops to argue with! Amazon’s own offering is called Lambda.

Also: 30 questions to ask a serverless fanboy

What is Quicksight?

Amazon’s even jumped into the fray at the presentation layer. Quicksight is a BI tool along the lines of mode, domo, looker or Tableau.

Now it’s possible to stay completely within the cozy Amazon ecosystem even for business insight and analytics.

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Some irresistible reading for March – outages, code, databases, legacy & hiring

via GIPHY

I decided this week to write a different type of blog post. Because some of my favorite newsletters are lists of articles on topics of the day.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s what I’m reading right now.

1. On Outages

While everyone is scrambling to figure out why part of the internet went down … wait is S3 is part of the internet, really? While I’m figuring out if it is a service of Amazon, or if Amazon is so big that Amazon *is* the internet now…

Let’s look at s3 architectural flaws in depth.

Meanwhile Gitlab had an outage too in which they *gasp* lost data. Seriously? An outage is one thing, losing data though. Hmmm…

And this article is brilliant on so many levels. No least because Matthew knows that “post truth” is a trending topic now, and uses it his title. So here we go, AWS Service status truth in a post truth world. Wow!

And meanwhile the Atlantic tries to track down where exactly are those Amazon datacenters?

Also: Is Amazon too big to fail?

2. On Code

Project wise I’m fiddling around with a few fun things.

Take a look at Guy Geerling’s Ansible on a Mac playbooks. Nice!

And meanwhile a very nice deep dive on Amazon Lambda serverless best practices.

Brandur Leach explains how to build awesome APIs aka ones that are robust & idempotent

Meanwhile Frans Rosen explains how to 0wn slack. And no you don’t want this. 🙂

Related: 5 surprising features in Amazon’s serverless Lambda offering

3. On Hiring & Talent

Are you a rock star dev or a digital nomad? Take a look at the 12 best international cities to live in for software devs.

And if you’re wondering who’s hiring? Well just about everyone!

Devs are you blogging? You should be.

Looking to learn or teach… check out codementor.

Also: why did dev & ops used to be separate job roles?

4. On Legacy Systems

I loved Drew Bell’s story of stumbling into home ownership, attempting to fix a doorbell, and falling down a familiar rabbit hole. With parallels to legacy software systems… aka any older then oh say five years?

Ian Bogost ruminates why nothing works anymore… and I don’t think an hour goes by where I don’t ask myself the same question!

Also: Are we fast approaching cloud-mageddon?

5. On Databases

If you grew up on the virtual world of the cloud, you may have never touched hardware besides your own laptop. Developing in this world may completely remove us from understanding those pesky underlying physical layers. Yes indeed folks containers do run in “virtual” machines, but those themselves are running on metal, somewhere down the stack.

With that let’s not forget that No, databases are not for containers… but a healthy reminder ain’t bad..

Meanwhile Larry’s mothership is sinking…(hint: Oracle) Does anybody really care? Now’s the time to revisit Mike Wilson’s classic The difference between god and Larry Ellison.

Read: Are SQL Databases Dead?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why Oracle won’t kill MySQL

oracle mysql database

Join 15,000 others and follow Sean Hull on twitter @hullsean.

1. MySQL does not compete with Oracle

It’s a myth that MySQL somehow poses a threat to Oracle. Oracle’s customers tend to be large enterprises running apps like e-business suite. These are certified to run on Oracle, and further they sit close to finance.

MySQL tends to be a choice of scrappy but nimble startups for their web-facing applications. They want to deploy in the cloud, and don’t want to deal with licenses. Plus they have the techops chops to handle the bushwacking of open source.

Related: Why I wrote the book on Oracle & Open Source

2. Oracle bought Sun for the hardware business

Remember when Oracle acquired Sun? A lot of folks assumed Larry was after MySQL. Grab it & slowly smother it. But actually it was more frosting on the cake. Larry had for years expressed interest in cubes and clusters, and building an Oracle appliance. Whether this ever came to profitable fruition in the form of Exadata remains to be seen. But buying Sun for a song helped him do this.

Also: Why bemoaning AWS performance sounds like Linux detractors circa 1999

3. Larry blows with the wind on open source

He’s money minded, so you’ll see in his decisions that comes first.

In the late 90’s when a customer might spend $100k on Sun and $100k on Oracle licenses, Larry realized porting to Linux and pushing commodity hardware would be a win. So he pushed Linux, and customers could now spend $20k on commodity hardware and $180k on Oracle licenses for them. Imagine the 10million dollar budget if you’re having trouble with the math here.

He also eventually moved the middle tier to Apache for similar reasons. I would argue Oracle corp overall pays lip service to contributing to open source, but they do that to some degree.

Read: Why MySQL dbas are so hard to find

4. MySQL support business is real

What’s more, just as adopting Linux, and then offering their “unbreakable Linux” distro, and pricey support along with it, they’re doing similar things with MySQL. For enterprise customers, and those already comfortable with making the call to Redwood Shores, sales folks will happily direct them support contracts and enterprise add-ons. Naturally.

Read: Why your startup needs real techops

5. There are real viable alternatives to keep balance

And let’s not forget folks, there are already a bunch of forks. There’s the popular and every growing Mariadb which Google has put their muscle behind.

Of course let’s not forget the very popular, very capable, and very bulletproof Percona distribution, along with the Percona toolkit and xtrabackup for real hotbackups.

And for those looking to experiment, there’s Drizzle a work in progress, complete rewrite, and one that’s unfortunately not a drop-in replacement.

Read this: What’s the four letter word dividing dev and ops?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why I Wrote the Book – Oracle and Open Source

Back in the late 90’s New York City was deep in the dot-com boom. Silicon Alley was being born, and a thousand internet startups were sprouting. Everyone was hiring, it was an exciting time to work in technology!

Join 11,500 others and follow Sean Hull on twitter @hullsean.

Trend Spotting Circa 2000

As an independent consultant, I had the opportunity to work at quite a few startups. The technology stack was identical at almost all of them. Sun Microsystems hardware, Apache webservers, and Oracle on the backend. The database was always the sticking point, and developers struggled to get their queries right.

It was an interesting role to hold. Most career DBAs worked at large fortune 500 firms, the old stodgy kind where nothing ever changes. Few of the Oracle old guard, the kind you’d meet at User Groups or conferences, had much exposure to Linux, and they certainly didn’t trust it.

Also: Here’s how to do a scalability performance review

Meanwhile in the startup scene in NYC I was seeing the cutting edge uses of the technology, with more and more shops switching to Linux and commodity hardware. There was even talk of *gasp* Oracle porting to Linux. There was a real rumor mill around all of this.

Oracle and Open Source Published – 2001

Seeing this shift towards commodity hardware, and the tremendous demand for Oracle married with open source technologies, I pitched O’Reilly and Associates with a book idea. Let’s talk about what’s happening in the trenches. How and when does Oracle – the most commercial of relational databases, work with Open Source technologies? What is in the mix? What are real firms using it for? What tools and technologies can help firms grow faster?

Related: Oracle DBA Interview questions for managers, candidates & recruiters alike

These were the questions my co-author and I sought to answer, and to judge from the response I think we did a very good job. As that push continued, Oracle eventually ported it’s enterprise database to Linux. This was a seismic shift that meant existing Oracle customers would spend a lot less on hardware, and thus have more to spend on Oracle licenses. Win-win except for Sun. The trend continued with Oracle pushing Apache into the mix as well.

Fast Forward a Decade

Now a decade later, Oracle has bought it’s former partner Sun, and in so doing owns MySQL too.

Read this: Top MySQL Interview questions for Devops, managers & recruiters

What new trends are happening? We hear an incessant drum of hype around cloud computing. In many ways the trend parallels what happened a decade ago. See our related piece a history lesson for cloud detractors. How so?

[quote]Commoditization: push towards new platforms, driven by cost. [/quote]

But this is slowed by an equally large stumbling block.

[quote]Performance: new cloud servers can’t compete with their big iron cousins. Not yet at least.[/quote]

Interested in Amazon EC2? We wrote an Intro to EC2 Cloud Deployments article which digs in deeper.

What’s Next for Datacenters

Commiditization will continue, driving costs downward. This will provide more gravity to cloud migrations for firms big and small.

Performance will improve. Cloud services like Amazon EC2 will get bigger & better, as will the all important network & disk subsystems.

Also: 5 things toxic to scalability

Big enterprises are already dipping their feet in the water with VPC technology, tying their existing datacenter to a cloud. They can grow elastically while still having feet firmly planted on the ground.

As large enterprises begin to get experience behind the wheel, it’ll chip away at the stranglehold of Oracle and the huge taxation type licensing that firms struggle with today. Where salesforce.com had a huge impact, workday.com will be even bigger.

[quote]The cloud will finally disrupt the last old guard industry – enterprise software.[/quote]

Read this far? Get us monthly in your inbox. Grab our scalable startups newsletter!

Best of Guide – Highlights of Our Popular Content

We cherry pick the top 5 most popular posts of various topics we’ve covered in recent months.

Oracle to MySQL – prepare to bushwhack through the open source jungle

oracle to mysql

I was recently approached by a healthcare company for advice on suitable database solutions capable of executing its new initiative. The company was primarily an Oracle shop so naturally, they began by shopping for possible Oracle solutions.

The CTO relayed his conversation with the Oracle sales rep, who at first recommended an Oracle solution that, expensive as it may have been, ultimately aligned with the company’s existing technology and experience. Unfortunately this didn’t match their budget and so predictably, the Oracle sales rep whipped out a MySQL-based solution as an alternative.

Having worked as an Oracle DBA throughout the dot-com years, I know the technology well. I also know the cultural differences between enterprises that choose Oracle solutions and those that choose open-source ones.

This encounter with the healthcare firm struck me as a classic conundrum for today’s companies who are under pressure to meet business targets under a tight budget, and in a very short time.

Can an open-source solution like MySQL be the answer to such huge demands?

The Oracle sales rep will likely nod excitedly and say no sweat. But as a consultant I could only manage an equivocal yes.

As the healthcare CTO rattled off the list of products he wanted to use, specific RTOs and RPOs (recovery time objective + recovery point objective – all I could think was to react with concern.

In my experience with startup after startup I’ve seen plenty of different MySQL installations but I’d never heard of one with the technology stack he described. What’s more I’d never heard of these solutions described with the Oracle Corp titles.

On one hand I wanted to discuss the merits of the solution he was keen to implement, while on the other, I was expressing concern over possible directions and paths we might take.

An Oracle cluster is not a MySQL cluster

The solution Oracle suggested was a MySQL Cluster. The term cluster unfortunately means different things to different people. Such loose usage of the word dilutes its meaning. In particular a lot of Oracle technologists expect that this solution might be similar to Oracle’s Real Application Cluster technology. It’s not. There are a lot of limitations, and frankly it’s really just a different beast.

The list also included various management dashboards which Oracle likes to push, but which I rarely see in my consulting assignments. What’s more I heard nothing about replication integrity considering that replication problems are an ongoing concern for real-world MySQL installations due to the particular technology used under the hood. There are reliable solutions to this problem but none yet available from Oracle. In fact, this is a big problem but one that may be completely off the sales guys’ radar.

Don’t let sales frame your architecture

Honestly, I don’t have a particularly large axe to grind with the sales guys. They have a job to do, and providing solutions which bring revenue to their firm and commissions for themselves is what puts food on their tables. Each party is motivated in different ways. But as a company shopping for solutions, this should be kept clearly in mind when starting down that road.

Beware prescribed architectural frameworks that appear too easy because they almost always don’t do what they say on the tin. Unfortunately sales folks don’t have experiencing designing architectures in the real world, so they can’t really know how the technologies work beyond the data sheet with feature bullet points.

As we all know in the technology space, all software come with bugs and real-world experience does not match the feature lists in the brochures. In law they have de jure and de facto. The former describes what is written and the latter, what’s practiced. For technology solutions, its never just adding water for something to work.

Do your homework

Before you embark on a new trip through the open source technology jungle, do some due diligence. Read up on real-world solutions, and how other large firms are using the technology. What configurations are they having success with? Which are causing trouble for a lot of people.

One of the great advantages of open-source are the very vibrant communities, forums and discussion groups where people are glad to share their experiences and offer advice.

Allow sufficient time to test and
bring your team up to speed

This is very important one. Shifting from an enterprise that relies primarily on Oracle for it’s relational database solution over to one that relies on open source technologies is a very big step indeed. Open-source technologies tend to be much more do-it-yourself and roll your own. Oracle solutions tend much more toward predefined paths and solutions and prescriptions for customers.

There are merits to each of these paths, with attendant pros and cons. But they are decidedly different. It’s likely that your team will also require time to get up to speed, not just with the particular software components, but with the new process by which things happen in the open-source space. Allow sufficient time for this shift to take place, lest you create more problems than solutions.

Extract Transform & Load – What is it and why is it important?

So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.

Extract

Source systems may store data in an infinite variety of formats.  Extracting involves getting that data into common files for moving to the destination system.  CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well.  In MySQL INTO OUTFILE syntax can perform this function.  If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell.  In Oracle you would use the spool command in SQL*Plus the command line shell.  Spool sends subsequent output from the screen also to a file.

Transform

This step involves modifying the extracted data in preparation for moving it into the target database server.  It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data.  You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server.  Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.

Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with.  Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed.  Then there is sort which can sort data and send the results to stdout or the file of your choosing.  Other tools include wc – word count, cut which can remove columns and so forth.
Load

This final step involves moving the data into the database server, and it’s final target tables.  For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.

Quora discussion by Sean Hull – What is ETL?