Saturday, February 25, 2012

Talend Open Studio: How to set up context variables


Talend Open Studio: How to set up context variables

A common concept of making your ETL process easy to configure is to use global variables. This caters for scenarios when you have to move your ETL processes from development to testing and then to production. The idea is to change a few variable values in a central file and then your ETL process will already run on another environment: What a time saver!

In this tutorial we will take a look at how to achieve this with Talend Open Studio for Data Integration.


Common global variables

First let’s have a look at settings that we would like use a variable for:

Database details:
source_database
source_username
source_password
source_port
source_schema
source_server

target_database
target_username
target_password
target_port
target_schema
target_server

reject_database
reject_username
reject_password
reject_port
reject_schema
reject_server

logging_database
logging_username
logging_password
logging_port
logging_schema
logging_server

ETL job execution:
date_start
date_end
temp_data_dir
source_dir

The above are just examples, there are many more use cases.


How to create global variables

Talend has the concept of context variables. You can access them in the settings fields by pressing CTRL+SPACE.

In Talend Open Studio you can create several context groups which hold various variables.
Think of context groups as a bucket for related variables. For each context group you can define various contexts like production, devolpment, test etc. This is extremely useful when moving your ETL jobs from one environment to the other.

Note: You can create as many context groups as you want. Currently, every time you create a context group, you have to define the contexts as well. I added a new feature request which will allow you to define the contexts once in the project settings so that each time when you create a new context group these contexts are assigned by default. This should help to keep the contexts more consistent and manageable across multiple context groups.

To create new variables, right click on Contexts in the repository and choose Create context group:

First create the variables by pressing the + button and assign it a type:
Then click on the Value as tree tab and expend your variable definition. Note that the default context will be called default. To change this name and/or to add other contexts, click on the context icon on the top right hand corner:
The Configure Contexts dialog allows you to edit existing contexts or to add new ones. Once you defined your contexts, click OK.

Now you will see your new/altered context show up in the main dialog. Define if you want a prompt and prompt text for each variable/context combination. Finally define a value:
When you are done, click on Finish.

Specifying the variables this way allows you to use them across multiple jobs. Think of it as an approach to easily manage your variables across all your jobs. You can create variables for each job as well, but these ones will local to that job only (and hence not be available for other jobs).


How to use repository context variables within jobs

Once you have the context variables defined in the repository, you can easily add them to your job:
  1. Open the job and click on the Context tab. Then click on the Repository icon:
  2. Select the variable you want to add and click OK.
  3. These variables will now show up in the context tab. Note that the variable will be available with the context prefix:
You can now use the variables in the component settings by pressing CTRL+SPACE:


Here two examples on how to use the variables in a query:

"SELECT * FROM raw_data WHERE date>= DATE '"+context.date_start+"' AND date<= DATE '"+context.date_end+"'"

"SELECT * FROM raw_data WHERE date>= TO_DATE('"+context.date_start+"','yyyyMMdd') AND date<= TO_DATE('"+context.date_end+"','yyyyMMdd')"

And here you can see an example using a context variable to define part of the file path:


How to define the context on execution

While designing you job, you can choose the context from the Run tab:
When you export your job, you will also have an option to specify the context.


How to load context variables dynamically

You can use the tContextLoad component to load the variables dynamically in example from a file when you run the jobs with different environments (See forum question for details).


Easily setup context variable for your connection

When setting up connections in the metadata repository, you can easily auto-generate context variables for the settings. To do so, press the Export as context button:
In the next dialog you can give it a name, then click next and you will have to option to alter the auto-generated variable list:
Now add a new query to the Metadata Repository using the SQL Builder. Make sure to tick context mode. Go to the Designer tab, right click on the text area and choose Add tables. Use the visual tool to build your query. Then switch back to the Edit tab and you will see the SQL Builder made use of the context variables.



I hope that this short introduction to context variables will help you to make your data integration jobs easier to configure!

Monday, February 20, 2012

Talend Open Studio: Populating a date dimension


Populating a date dimension

Date dimensions are an essential part of a data warehouse. Usually they are only populated once. Scripts can be created on the database side (as outlined here), but if you are working on various projects involving a variety of databases, it is more efficient to create just one ETL job which can be used to populate any database.

In this tutorial we will have a look at creating such an ETL job with Talend Open Studio for Data Integration. We will create a basic date dimension which you can then extend even further. In order to follow this tutorial, the reader should be familiar with the basic functionality of Talend Open Studio.

Our date dimension will look like this one (partial screenshot):


The primary key of the dimension will be an integer representation of the date, which saves us the hassle of looking up the key when we transform the fact data.

Open Talend Open Studio for Data Integration and create a new job called populate_date_dimension. First we will define a variable called start date, because we will use this job in various projects and we might require a different start date each time:

Click on the Context tab and then on the + button to add a new context variable. Give it the name myStartDate of type Date and define a value for it.


Next add a tRowGenerator component to the design area and double click on it to activate the settings dialog. The idea is to create X amount of rows: The first row will hold our start date and each subsequent row will increment the date by one day.


  1. Click the + button to add a new column. Name it date and set the type to Date.
  2. Click in the Environment variables cell on the right hand side and then you will see the parameters displayed in the Function parameters tab on the bottom left hand side.
  3. Define the number of rows that should be generated in Number of Rows for RowGenerator.
  4. In the Function parameters tab set the date parameter value to context.myStartDate. This will ensure that the context variable which we defined earlier will be used.
  5. Set the nb parameter to Numeric.sequence(“s1”, 1, 1) - 1. Use the expression builder for a more convenient setup. This will create a sequence which we will use to add days to our start date. The reason why we subtract 1 at the end is because we want to keep our start date.
  6. Set the dateType parameter value to “dd”. This ensures that days will be added to our date.
  7. Click on the Preview tab and check if the result set looks as expected.
  8. Click Ok to close the component settings.


Now add a tMap component and create a row from the tRowGenerator to the tMap component. Double click the tMap component:




  1. Click the + button on the right hand side to create a new output table.
  2. Add new columns to the output table and for each of them define a specific date format using this approach: Integer.parseInt(TalendDate.formatDate("yyyyMMdd",row1.date)) for integer values and TalendDate.formatDate("MM",row1.date) for string values. Have a look at the Java SimpleDateFormat specs to get an understanding of all the formatting options. You will spend now some time setting all the various date formats up.
  3. Java SimpleDateFormat doesn’t provide a quarter format, hence we have to create our own in the form of a ceiled devision / covered quotient: (Integer.parseInt(TalendDate.formatDate("M",row1.date))+3-1) /  3   
  4. Click Ok.


Add a database output component of your choice (in my case I used one for PostgreSQL) and create a row from the tMap to the database output component. Double click the database output component and provide all the necessary settings. That’s it: Now you can run the job and examine the data in your table.

Saturday, February 18, 2012

Talend Open Studio: Scheduling and command line execution


Talend Open Studio: Scheduling and command line execution

In this tutorial we will take a look at how to export a Talend Open Studio ETL job to an autonomous folder and schedule the job via crontab. In order to follow this tutorial, the reader should be familiar with the basic functionality of Talend Open Studio for Data Integration.


How to export a job


Right click on your job and choose Export job.


In the export settings define:
  • the export folder and file name
  • the Job Version
  • set the Export type to Autonomous Job
  • tick Export dependencies
  • define the Context and tick Apply to children
Click on Finish and your job will be exported.


How to execute the job from the command line


Navigate to the folder where the zip file was exported to and unzip it. Then navigate to:


<jobname>_<version>/<jobname>

Within this folder you will find an executable shell and/or batch file:


Open this file in a text editor:


Note that the context is defined as a command line argument. It is currently set to the value which you specified on export, but you can change it any time to another value here.

To execute the job on the command line simply navigate to this folder and run:
sh ./<jobname>_run.sh



How to execute a job with specific context variables

As you might have guessed, the approach is very similar to the one shown above, we just add command line arguments:

sh ./<jobname>_run.sh --context_param variable1=value1 --
context_param variable2=value2



How to change the default context variables

If you ever need to change the value of any of your context variables, you can find the property file for each context in:

<jobname>_<version>/<jobname>/<projectname>/<jobname>_<version>/contexts/

Which in my case is:

Open one of them to understand how they are structured:

As you can see it is extremely easy to change these values.


How to schedule a job

If you make use of context variables regularly, then it is best to include them directly in the *_run.sh or *_run.bat file. Just open the file with your favourite text editor and add the variables after the context argument similar to this one:
Ideally though, especially if you are dealing with dates, you want to make this more dynamic, like this one:
On Linux use Crontab to schedule a job:

crontab -e

And then set it up similar to the one shown below:

On Windows you can use the Windows Scheduler. As this one has a GUI, it is quite straight forward to set it up and hence will not be explained here.

Friday, February 17, 2012

Talend: Setting up database logging for a project


Talend: Setting up database logging for a project

When executing Talend ETL jobs it is quite important to store the logging information. This is not only relevant for the last execution of an ETL job, but keeping a longer logging history can be quite an advantage. This logging information can be stored in flat files or in a database. We will have a look at the latter option here. I will not go into too much detail, but provide a quick overview. I expect that your are familiar with the basic functionality of Talend Open Studio for Data Integration.
  1. Open Talend and create a new job. Drop the tLogCatcher, tStatCatcher, tMeterCatcher components on the design area:
  2. Click on tLogCatcher. In the component settings click on Edit Schema. In the schema dialog click on the save button. Give the schema a name. Do the same for tStatChatcher and tMeterCatcher. The schemas will then show up in the repository under Generic schemas:
  3. As we don’t need these components any more, deactivate all of them by right clicking on them and choosing deactivate.
  4. Create a new connection to your database of choice in the repository. This will be the database where all the logging data will be stored.
  5. Next we will create the logging tables: Add three tCreateTable to the design area and link each of them with an onSubjobOk row:
  6. In the component settings, assign for each of them the repository database connection.
  7. In the component settings, assign one of the generic repository schemas to one of the three components and assign a table name:
  8. Run the job. All the three logging tables should now exist in your database.
  9. Let’s add these tables to the repository database connection we defined early: Right click on the connection and choose Retrieve schema. Choose the three logging tables and click Ok.
  10. Now we can assign these repository schemas/table definitions to the project settings. In the main menu click on File > Edit project properties. Click on Stats & Logs and then tick On Databases. Assign the repository database connection and assign the respective repository schemata  to the log tables. Finally tick Catch components statistics.

Now logging is set up for your project.


In future you can run some simple SQL statements to retrieve some info about the performance of your ETL jobs.

Sunday, February 12, 2012

PostgreSQL: Auto generating a sample dataset

Sometimes you just want to create simple sample datasets for tables quickly. If you are lucky your database provides some native SQL functions for this purpose.

PostgreSQL has quite an interesting recursive WITH statement which allows you to create a loop: You can define in example a value that is incremented with each iteration until a certain condition is met. This comes in quite handy when creating sample data for a date dimension in example:

Let's keep it very simple:


WITH RECURSIVE date_generator(date) AS (
   VALUES (DATE '2012-01-01')
 UNION ALL
   SELECT date+1 FROM date_generator WHERE date < DATE '2012-01-15'
)
SELECT
date
, EXTRACT(DAY FROM date) AS day
, EXTRACT(MONTH FROM date) AS month
, EXTRACT(QUARTER FROM date) AS quarter
, EXTRACT(YEAR FROM date) AS year
FROM
date_generator
;


In the WITH statement we provide a start date (in this case 2012-01-01) and increase it by 1 until a specific end date (in this case '2012-01-15') is reached. In the main query we make use of the auto generated dates by extracting various date periods. The output looks like this:

Now you can easily change to above query to insert the auto generated data into a table. This is a very elegant solution as everything can be set up using standard SQL. Have look the official PostgreSQL documentation for more information.