ETL – Data Wharehouse

Above is the common elements that make up a Data Warehouse. The data warehouse consists of several different elements, the source can come from legacy systems that are usually an operational system used by the corporation or external data sources, the data staging area where it is processed and moved to the presentation server where the data is organized and stored for future queries and reports. The last step is the end user data access point, currently one of the most popular forms to access data is through a web page.

The first step is extraction and involves obtaining the data from the source systems.

Next it is transforming the data using a series of steps to make the data usable for the data warehouse. The steps consists of cleaning the data to improve data quality and consistency, purging of any unnecessary data that is not required by the data warehouse, combining of similar data from different source systems and creating surrogate keys.

After the data is extracted, transformed and cleansed, it is ready to be loaded and indexed into the warehouse for fast querying.

Other steps include running a series of reports to ensure that the data stays consistent and that the quality stays intact. The data is released to users for generating reports and dashboards and is secured against unauthorized user access.

Main steps in the ETL cycle are:

  • Initiating the Cycle
  • Building of reference data
  • Extraction
  • Validation
  • Transforming
  • Staging (if staging tables are necessary)
  • Audit reports to verify that all business rules are kept
  • Publishing to warehouse tables
  • Archiving
  • Clean up.

rsnapshot: Auto Version Controlled Backup for Unix/Linux/Mac/BSD…

Red Had Enterprise Linux (RHEL) 6. I missed the ease of configuration and all the free tools that people smarter than me have created.

Systems that can take advantage of Rsnapshot

Systems that can take advantage of Rsnapshot

I would like to do a fast post on rsnapshot. I have seen ssh and rsnapshot scheduled in cron to automate backups of OSX to a Linux server. Since we didn’t want the wireless to slow down we only used the physical MAC address of the MAC. What makes rsnapshot so great is that it will wok on so many systems that are out there (Ubuntu, Debian GNU/Linux, Red Hat Linux, Fedora Linux, SuSE Linux, Gentoo Linux, Slackware Linux, FreeBSD, OpenBSD, NetBSD, Solaris, Mac OS X, and even IRIX) .

For now I’m using it for personal automated backups to my external hard drive. There are plenty of other advanced options and examples on the Internet. I just wanted to get out a fast an easy example.

  1. First – find and install rsnapshot. for Red Hat this was
    $ sudo yum install rsnapshot
    (rsynch is a dependancy that should already be installed).
  2. After install if you do not have this file /etc/rsnapshot.conf. Use the command:
    $ sudo cp /etc/rsnapshot.conf.default /etc/rsnapshot.conf
  3. Edit rsnapshot.conf – The defaults I changed from the default configuration file are below. These options allow me to back up everything in /etc/ and /home/. Backups kept will be twice a day, 7 days a week, 4 weeks, 12 months and 5 years (change this as you see fit).  Most important is that switch to make sure that the mount point will not be created and wrote to locally if the disk is not attached.
    1. WHERE TO PLACE BACKUPS
      # All snapshots will be stored under this root directory.
      #
      snapshot_root   /media/myexternal/rsnapshot/
    2. DO NOT CREATE IF DISK IS NOT CREATED
      # If no_create_root is enabled, rsnapshot will not automatically create the
      # snapshot_root directory. This is particularly useful if you are backing
      # up to removable media, such as a FireWire or USB drive.
      #
      no_create_root 1
    3. INTERVALS (make sure this is tabbed – do NOT use spaces)
      #########################################
      #           BACKUP INTERVALS            #
      # Must be unique and in ascending order #
      # i.e. hourly, daily, weekly, etc.      #
      #########################################
      interval        hourly  12
      interval        daily   7
      interval        weekly  4
      interval        monthly 12
      interval        yearly  5
  4. Time to configure cron. Most people will tell you to create your jobs using $ crontab e
    I prefer to use the root crontab using $ sudo vim /etc/crontab shown below:

    0 */12 * * * root /usr/bin/rsnapshot hourly # Every 12 hours
    30 23 * * * toot /usr/bin/rsnapshot daily   # Daily at 11:30PM
    20 2 * * 0 root /usr/bin/rsnapshot weekly   # Sunday at 2:20AM
    10 5 1 * * root /usr/bin/rsnapshot monthly  # First day of the month at 5:10AM
    01 8 1 1 * root /usr/bin/rsnapshot yearly   # January 1st at 8:01AM
  5. Test It – Following these steps you should have the basic setup needed to run rsnapshot on your personal computer to an external hard drive or usb. Just one last thing to do. Make sure that your hard drive is plugged in and  run:
    $ sudo rsnapshot -V hourly
    rsnapshot should give you plenty of verbose information as it creates your first hourly backup inside the location you specified. If there is a issue with the lock file, remove the lock file and try again.

Still stuck?

There are many other helpful documents out there  start with the rsnapshot how to:
http://www.rsnapshot.org/howto/1.2/rsnapshot-HOWTO.en.html#installation

If you want to learn how to do remote backup and use OSX? try this article:
http://blog.philippmetzler.com/?p=138

As I said in the beginning of this article, this was a fastpost and not meant to cove everything about rsnapshot. It took longer to write this article than it did to set up rsnapshot.
Good Luck – Adam M. Erickson

 

[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

Elements Used in a Data Warehouse

Many Business Intelligence solutions are based on the use of a data warehouse. Here is a view of the components of a data warehouse both logically and physically.  Data Warehouse Elements

The data warehouse consists of several different elements:

  • The source can come from legacy systems that are usually an operational system used by the corporation or external data sources,
  • The data staging area is where data is processed (normalized and some history is stored) and moved to the presentation server.
  • The presentation server takes the data, organizes it and stores is for future quires and reports
  • The last step is the end user data access point, currently one of the most popular forms to access data is through a web page and mobile applications.

You can download the visio drawing I created here –>data-warehouse-elements.vsd