Data Integration & Warehousing: 2014

Change Data Capture can be used for small data integration projects which include very less workflows. We need to use Informatica mapping variable to implement the change data capture.

We will be using Informatica Mapping Variables to achieve our Change Data Capture logic.

Informatica Mapping Variable:

Informatica Mapping Variable will be created in PowerCenter Designer, which can be used in any expression in a mapping, and also in a source qualifier filter, user-defined join, or extract override, and in the Expression Editor of reusable transformations.

Mapping variable can take the starting value from:

Initial value
Default Value
Parameter file
Pre-session variable assignment
Value saved in the repository

In informatica, integration Service looks for the start value in the order mentioned above. Value of the mapping variable can be changed with in the session using an expression and the final value of the variable will be saved into the repository. The saved value from the repository is retrieved in the next session run and used as the session start value.

How to Set Mapping Variable Value?

You can set/change the mapping variable value with in the mapping or session using the Set Function. We need to use the set function based on the Aggregation Type of the variable. Aggregation Type of the variable can be set when the variable is declared in the mapping.

SetMaxVariable: Sets the variable to the maximum value of a group of values. To use this variable, the aggregation type of the mapping variable must be set to Max.
SetMinVariable: Sets the variable to the minimum value of a group of values. To use this variable, the aggregation type of the mapping variable must be set to Min.
SetCountVariable: Increments the variable value by one. In other words, it adds one to the variable value when a row is marked for insertion, and subtracts one when the row is marked for deletion. To use this variable, the aggregation type of the mapping variable must be set to Count.
SetVariable: Sets the variable to the configured value. At the end of a session, it compares the final current value of the variable to the start value of the variable. Based on the aggregate type of the variable, it saves a final value to the repository.

Change Data Capture Implementation:

Now, let’s start building our mapping with Change Data Capture.

Here we are going to implement Change Data Capture for CUSTOMER data load. We need to load any new customer or changed customers data to a flat file. Since the column LATEST_DT value changes for any new or updated customer record, we will be able to find the new or changed customer records using LATEST_DT column.

Firstly, create mapping variable while making mapping as shown in below image.

$$M _END_ DATE as Date/Time

Now bring in the source and source qualifier to the mapping designer workspace. Open the source qualifier and give the filter condition to get the latest data from the source as shown below.

STG_CUST.LATEST_DT > CONVERT(DATETIME,'$$M _END_ DATE’)

Note: Latest value for the variable $$M_END_ DATE is retrieved from the repository every time the session is run.

Now map the column LATEST_DT to an expression transformation and create a variable expression as below.

SETMAXVARIABLE($$M _END_DATE, LATEST_DT)

Note: This expression will make sure that, latest value from column LATEST_DT is stored into the repository after successful completion of the session run.

Now you can map all the remaining columns to the downstream transformation and complete all other transformation required in the mapping.

Informatica Mapping

Above explained things you need to configure Change Data Capture, Now create your workflow and run the workflow.

If you look into the session log file, you can see the mapping variable value is retrieved from the repository and used in the source SQL.

You can see mapping variable value stored in the repository, from workflow manager. Choose the session from the workspace, right click and select 'View Persistent Value’.

Q. What is DataStage?

Design jobs for Extraction, Transformation and Loading(ETL).
Ideal tool data integration projects such as data warehouses, data marts and system migrations.
Import,export,create and managed metadata for use within jobs.
Schedule, run and monitor jobs all within DataStage
Administers your Datastage development and execution environments.

Q. What are the components of Ascential Data Stage?

Client Components - Administrator, Director, Manager, and Designer.
Server Components - Repository, Server and Plug-ins.

Q. Which Command is required to check the Datastage Job Status?

dsjob -status

Q. What is version Control?

Version Control stores different versions of DS jobs runs different versions of same job reverts to previous version of a job view version histories.

Q. What is the difference between DataStage and Informatica?

DataStage support parallel processing which informatica doesn't.
Links are object in the DataStage ,in Informatica it’s a port to port connectivity.
In Informatica it’s easy to implement Slowly Changing Dimensions which is little bit complex in DataStage.
DataStage doesn't support complete error handling.

Q. What is a merge?

Merge is a stage that is available in both parallel and server jobs.
The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel).
Merge requires that the master table/dataset and the update table/dataset to be sorted.
Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table.

Q. Tell us part/components of configuration file:

Node
ServerName
Pools
FastName
ResourceDisk

Q. In which two situations would you use the Web Services Client stage?

You need the Web service to act as either a data source or a data target during an operation.
You do not need both input and output links in a single web service operation.

Q. What are Sequencers?

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.

Q. How to unlock a Datastage job?

Clean-up Resources in Director
Clear Status File in Director
DS.Tools in Administrator
DS.Tools in UNIX

Q. What is Active and Passive Stage?

Passive Stages are used for data extraction and loading ,while Active Stage are used to implements and process the business rules.

Q. How do you generate Sequence number in Datastage?

Using the Routine
KeyMgtGetNextVal
KeyMgtGetNextValConn
They can also be done by Oracle Sequence.

Q. What are the Job parameters?

These Parameters are used to provide Administrative access and change run time values of the job.EDIT > JOBPARAMETERS. In that Parameters Tab we can define the name,prompt,type,value.

Q. What is the difference between Hashfile and Sequential File?

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot.

Q. What is the difference between Account and Directory options?

Account option create the hashfile in the current project itself using condition and for Directory option we have to specify the path of the directory.

Q. What is the difference between Server Job and Parallel Jobs?

Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel Extender work on the principal of pipeline and partition) for Inpur/Output processing.

Q. What are the types of jobs we have in Datastage?

Server Jobs
Parallel Jobs
Mainframe Jobs

Q. What is Staging Variable?

These are the temporary variables created in transformer for calculation.

Q. What are Routines?

Routines are the functions which we develop in BASIC Code for required tasks, which we Datastage is not fully supported (Complex).

Q. Unix Command to stop Datastage Engine?

/bin/uv –admin –stop

Q. Unix Command to start Datastage Engine?

/bin/uv –admin –start

Q. Unix command to check datastage jobs running at server?

ps -ef | grep phantom

Q. Unix Command to check Datastage sessions running at backend?

netstat –na | grep dsr
netstat –a | grep dsr
netstat –a | grep dsrpc

Q. Where datastage temprory dataset files stored while running a Datastage parallel Job?

ResourceScratchDisk

Q. You need to invoke a job from the command line that is a multi-instance enabled. What is the correct syntax to start a multi-instance job?

dsjob -run
-mode NORMAL

Q. A client must support multiple languages in selected text columns when reading from DB2 database. Which two actions will allow selected columns to support such data?

Choose Unicode setting in the extended column attribute.
Choose NVar/NVarchar as data types.

Q. What are the types of Containers?

There are Two types of containers namely

Local Container
Shared Container

Q. Define data aggregation?

An operational data source usually contains records of individual transactions such as product sales. If the user of a data warehouse only needs a summed total, you can reduce records to a more manageable number by aggregating the data.

Q. What are stage variables?

An intermediate processing variable that retains value during read and doesn’t pass the value into target column.

Q. Which two system variables/techniques must be used in a parallel Transformer derivation to generate a unique sequence of integers across partitions?

@PARTITIONNUM
@NUMPARTITIONS

Q. What is the default Array Size in OCI stage?

32767

Q. What is the default cache size of Datastage?

256MB

Q. Which is a part of Managing active sessions in Datastage?

Viewing all active sessions
Setting session limits
Opening user details
Disconnecting a session

Q. What can you do from the Administrator client?

Set up user permissions for projects
Purge job log file
Set Environment variable default value
Add, delete, and move InfoSphere® DataStage® projects

Q. What all are the different way to run a job?

Using DataStage Director and DataStage Administration Client.

Q. How can we improve the performance in DataStage?

In server canvas we can improve performance in two ways:

Firstly we can increase the memory by enabling interprocess row buffering in job properties
Secondly by inserting an IPC stage we break a process into two processes.We can use this stage to connect two passive stages or two active stages.

Q. What is orabulk Stage?

This Stage is used to Bulk Load the Oracle Target Database.

Q. How do you schedule or monitoring the job?

Using the DataStage Director we can schedule or monitor the job.

Q. What is APT_CONFIG in datastage?

Datastage understands the architecture of the system through this file(APT_CONFIG_FILE). For example this file consists information of node names, disk storage information etc.
APT_CONFIG is just an environment variable used to idetify the *.apt file.
Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.

Q. What are orabulk and bcp stages?

These are called as plug-in stages orabulk is used when v have bulk data in oracle then v go for orabulk for other than oracle database we go for bcp stages.

Q. Why fact table is in normal form?

A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules.

Q. What is the difference between the local container and shared container?

Local Container is local to the particular job in which we developed the container.
Shared Container is can be used in any other jobs also.

Q. What can we do with DataStage Director?

Validating
Scheduling
Executing
Monitoring Jobs (server Jobs).

Q. Which algorithm you used for your hashfile?

It uses GENERAL or SEQ.NUM. algorithm

Q. What all the types of Jobs you developed?

Server Job
Parallel Job
Sequencer Job
Container Job.

Q. How do you register plug-ins?

Using DataStage Manager.

Q. What are iConv and oConv?

These are the Date functions, which we use to convert the Dates from internal format to External format

iConv - External to Internal
oConv- Internal to External

Q. How do you convert the columns to rows in DataStage?

Using Pivot Stage.

Q. What are the types of input links in datastage?

DataStage supports two types of input link:

Stream : A link representing the flow of data. This is the principal type of link, and is used by both active and passive stages.
Reference : A link representing a table lookup. Reference links are only used by active stages. They are used to provide information that might affect the way data is changed, but do not supply the data to be changed.

Q. Define dynamic arrays?

Dynamic arrays map the structure of DataStage file records to character string data. Any character string can be a dynamic array. A dynamic array is a character string containing elements that are substrings separated by delimiters.

Q. Where the DataStage stored his repository?

DataStage stored his repository in IBM Universe Database.

Q. What are the command line functions that import and export the DS jobs?

dsimport.exe : imports the DataStage components.
dsexport.exe : exports the DataStage components.

Q. What is transformer stage?

Transformer stages do not extract data or write data to a target database. They are used to handle extracted data, perform any conversions required, and pass data to another Transformer stage or a stage that writes data to a target data table.

Q. How do you schedule or monitoring the job?

Using the DataStage Director we can schedule or monitor the job.

Q. Define universe stage?

A stage that extracts data from or loads data into a Universe database using SQL. Used to represent a data source, an aggregation step, or a target data table.

Q. How we can reuse the components?

Using the Shared and Local Containers.

Q. What are the Types of Jobs?

Server Jobs : Designed, developed and compiled are done in Designer. Validated, Scheduled, Executed and Monitoring are done in Director. Win NT/ Unix system Server.
Parallel Jobs : Designed, developed and compiled are done in Designer Validated, Scheduled, Executed and Monitoring are done in Director. Server should be on Unix System Only.
Mainframe Jobs : Designed and developed are done in Designer (win) Complied, Validated, Scheduled, Executed and Monitored in Mainframe Systems Only.

Q. In which two situations would not use the Web Services Client stage?

You want to deploy a service.
You need to create a WSDL.

Q. Which two actions can improve sort performance in a DataStage job?

Specify only the key columns which are necessary.
Minimize the number of sorts used within a job flow.
Adjusting the "Restrict Memory Usage" option in the Sort stage.

Q. You are experiencing performance issues for a given job. You are assigned the task of understanding what is happening at run time for the given job. What are the first two steps you should take to understand the job performance issues?

Run job with $APT_TRACE_RUN set to true.
Review the objectives of the job.

Q. Your customer asks you to identify which stages in a job are consuming the largest amount of CPU time. Which product feature would help identify these stages?

$APT_PM_PLAYER_TIMING

Q. Which three statements describe a DataStage installation in a clustered environment?

The conductor node will create the job score and consolidate messages to the DataStage log.
For clustered implementations, appropriate node pools are useful to reduce data shipping.
Compiled job and all shared components must be available across all servers.

Q. Which three defaults are set in DataStage Administrator?

Project level defaults for environment variables.
Project level default for compile options
Project level default for Runtime Column Propagation

Q. Which two environment variables should be set to "True" to allow you to see operator process statistics at run-time in the job log?

$APT_PM_PLAYER_MEMORY
$APT_PM_PLAYER_TIMING

Q. Which three statements are true about National Language Support (NLS)

NLS must be selected during installation to use it.
Within an NLS enabled DataStage environment, maps are used to convert external data into UTF-#6.
Reading or writing 7-bit ASCII data from a database does not require NLS support.

Q. Which three statements describe a DataStage installation in a clustered environment?

The conductor node will create the job score and consolidate messages to the DataStage log.
For clustered implementations, appropriate node pools are useful to reduce data shipping.
Compiled job and all shared components must be available across all servers.

Q. Upon which two conditions does the number of data files created by a File Set depend?

The number of processing nodes in the default node pool
The number of disks in the export or default disk pool connected to each processing node in the default node pool

Q. Which command line switch can be used to return the most recent start time for a given job?

dsjob -jobinfo

Q. You are working on a project that contains a large number of jobs contained in many folders. You would like to review the jobs created by the former developer of the project. How can you find these jobs?

Use the Advanced Find feature contained in the Designer interface.

Q. Which two statements are true for named node pools?

Using appropriately named node pools can allow separation of buffering from sorting disks.
Using appropriately named node pools constraints will limit stages to be executed only on the nodes defined in the node pools.

Q. Which three methods can be used to import metadata from a Web Services Description Language (WSDL document)?

Web Service Function Definitions
XML Table Definitions
Web Services WSDL Definitions

Q. What are two tasks that can create DataStage projects?

Install the DataStage engine.
Add new projects from DataStage Administrator.

Q. Which two conditions does the No of data files created by a Dataset Set depend?

The numbers of processing nodes in the default node pool.
The number of disks in the export or default disk pool connected to each processing node in the default node pool.

Q. Which requirement must be met to read from a database in parallel using the ODBC Enterprise stage?

Specify the partition col property.

Q. For APT_DISABLE_COMBINATION which statements are true?

Disabling generates more processes requiring more system resources and memory.
Globally disables operator combining.

Q. Techniques you will use to abort a job in Transformer stage?

Create a dummy output link with a constraint that tests for the condition to abort on set the "Abort After Rows" property to #.

Q. The dsrpcd daemon is the means by which processes that represent DataStage jobs are started?

The environment that DataStage processes inherit when they are started is the same environment as for dsrpcd. ODBC drivers and some plug-ins require that certain directories are included in the shared library environment variable setting for dsrpcd.

Q. A job design consists of an input sequential file, a Modify stage, followed by a Filter stage and an output Sequential File stage. The job is running on an SMP machine with a configuration file defined with three nodes. No environment variables were set for the job. How many osh processes will this job create?

Q. Using a DB2 for z/OS source database, a 200 million row source table with 30 million distinct values must be aggregated to calculate the average value of two column attributes. What would provide optimal performance while satisfying the business requirements?

Select all source rows using a DB2 API stage and then Aggregate using a Sort Aggregator.

Q. In your DB2 database you have column names that use characters # and $. Which two steps should be completed to allow DataStage to convert these characters into internal format and back as necessary?

Set environment variable $DS_ENABLE_RESERVED_CHAR_CONVERT to true.
Avoid using the strings __035__ and __036__ in your IBM DB2 column names.

Q. When invoking a job from a third-party scheduler, it is often desirable to invoke a job and wait for its completion in order to return the job's completion status. Which three commands would invoke a job named "BuildWarehouse" in project DevProject and wait for the job's completion?