Hadoop requires metadata repositories (relational databases) for Ambari (management), HiveServer2 (SQL), Oozie (scheduler and workflow tool) and Hue (Hadoop UI). Choices include Postgres, MySQL, Oracle or derby. The databases holding the Hadoop metadata repositories have to be backed up and maintained like any other database server.
I recommend using MySQL for the following reasons:
- Oracle is too heavyweight of a database server that it's full resources will not be utilized. The Oracle database server will take extra memory, disk space and CPU that will not be taken advantage of.
- Postgres is a good solid database but it has no tipping point. I do not see a lot of Postgres databases when I go to customers and I do not see Postgres increasing in the market.
- Derby (used with Ozzie) and SQLite (used with Hue) are not robust enough to be used in a heavy production environment. I would only use these databases if I was going to create a small Hadoop cluster for personal development.
- Extremely fast and lightweight.
- Relatively easy to administer and backup.
- Replication is very easy to set up and maintain.
- MySQL has extremely high adoption and it is easy to find resources to manage it.
- Main documents page: http://docs.hortonworks.com/
- Reference Guides - Supported Database Matrix: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-184.108.40.206/bk_reference/content/db-support-matrix.html
- Oracle Steps for Ambari: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-220.127.116.11/bk_using_Ambari_book/content/ambari-chaplast-3.html
- Oracle Steps for Hive: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-18.104.22.168/bk_using_Ambari_book/content/ambari-chaplast-1.html
- Oracle for Oozie: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-22.214.171.124/bk_using_Ambari_book/content/ambari-chaplast-2.html