Thursday, 11 December 2014

Change Data Capture in Informatica

Change Data Capture can be used for small data integration projects which include very less workflows. We need to use Informatica mapping variable to implement the change data capture.

We will be using Informatica Mapping Variables to achieve our Change Data Capture logic.

Informatica Mapping Variable:

Informatica Mapping Variable will be created in PowerCenter Designer, which can be used in any expression in a mapping, and also in a source qualifier filter, user-defined join, or extract override, and in the Expression Editor of reusable transformations.

Mapping variable can take the starting value from:
  • Initial value
  • Default Value
  • Parameter file
  • Pre-session variable assignment
  • Value saved in the repository
In informatica, integration Service looks for the start value in the order mentioned above. Value of the mapping variable can be changed with in the session using an expression and the final value of the variable will be saved into the repository. The saved value from the repository is retrieved in the next session run and used as the session start value.

How to Set Mapping Variable Value?

You can set/change the mapping variable value with in the mapping or session using the Set Function. We need to use the set function based on the Aggregation Type of the variable. Aggregation Type of the variable can be set when the variable is declared in the mapping.
  • SetMaxVariable: Sets the variable to the maximum value of a group of values. To use this variable, the aggregation type of the mapping variable must be set to Max.
  • SetMinVariable: Sets the variable to the minimum value of a group of values. To use this variable, the aggregation type of the mapping variable must be set to Min.
  • SetCountVariable: Increments the variable value by one. In other words, it adds one to the variable value when a row is marked for insertion, and subtracts one when the row is marked for deletion. To use this variable, the aggregation type of the mapping variable must be set to Count.
  • SetVariable: Sets the variable to the configured value. At the end of a session, it compares the final current value of the variable to the start value of the variable. Based on the aggregate type of the variable, it saves a final value to the repository.

Change Data Capture Implementation:

Now, let’s start building our mapping with Change Data Capture.
Here we are going to implement Change Data Capture for CUSTOMER data load. We need to load any new customer or changed customers data to a flat file. Since the column LATEST_DT value changes for any new or updated customer record, we will be able to find the new or changed customer records using LATEST_DT column.

Firstly, create mapping variable while making mapping as shown in below image.

        $$M _END_ DATE as Date/Time



Now bring in the source and source qualifier to the mapping designer workspace. Open the source qualifier and give the filter condition to get the latest data from the source as shown below.

STG_CUST.LATEST_DT > CONVERT(DATETIME,'$$M _END_ DATE’)



Note: Latest value for the variable $$M_END_ DATE is retrieved from the repository every time the session is run.

Now map the column LATEST_DT to an expression transformation and create a variable expression as below.

        SETMAXVARIABLE($$M _END_DATE, LATEST_DT)



Note: This expression will make sure that, latest value from column LATEST_DT is stored into the repository after successful completion of the session run.

Now you can map all the remaining columns to the downstream transformation and complete all other transformation required in the mapping.
Informatica Mapping

Above explained things you need to configure Change Data Capture, Now create your workflow and run the workflow.



If you look into the session log file, you can see the mapping variable value is retrieved from the repository and used in the source SQL.

You can see mapping variable value stored in the repository, from workflow manager. Choose the session from the workspace, right click and select 'View Persistent Value’.



Monday, 8 December 2014

Datastage Interview Questions and Answers

Q. What is DataStage?
  • Design jobs for Extraction, Transformation and Loading(ETL).
  • Ideal tool data integration projects such as data warehouses, data marts and system migrations.
  • Import,export,create and managed metadata for use within jobs.
  • Schedule, run and monitor jobs all within DataStage
  • Administers your Datastage development and execution environments.

Q. What are the components of Ascential Data Stage?
  • Client Components - Administrator, Director, Manager, and Designer.
  • Server Components - Repository, Server and Plug-ins.

Q. Which Command is required to check the Datastage Job Status?
  • dsjob -status

Q. What is version Control?
  • Version Control stores different versions of DS jobs runs different versions of same job reverts to previous version of a job view version histories.

Q. What is the difference between DataStage and Informatica?
  • DataStage support parallel processing which informatica doesn't.
  • Links are object in the DataStage ,in Informatica it’s a port to port connectivity.
  • In Informatica it’s easy to implement Slowly Changing Dimensions which is little bit complex in DataStage.
  • DataStage doesn't support complete error handling.

Q. What is a merge?
  • Merge is a stage that is available in both parallel and server jobs.
  • The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel).
  • Merge requires that the master table/dataset and the update table/dataset to be sorted.
  • Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table.

Q. Tell us part/components of configuration file:
  • Node
  • ServerName
  • Pools
  • FastName
  • ResourceDisk

Q. In which two situations would you use the Web Services Client stage?
  • You need the Web service to act as either a data source or a data target during an operation.
  • You do not need both input and output links in a single web service operation.

Q. What are Sequencers?
  • A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.

Q. How to unlock a Datastage job?
  • Clean-up Resources in Director
  • Clear Status File in Director
  • DS.Tools in Administrator
  • DS.Tools in UNIX

Q. What is Active and Passive Stage?
  • Passive Stages are used for data extraction and loading ,while Active Stage are used to implements and process the business rules.

Q. How do you generate Sequence number in Datastage?
  • Using the Routine
  • KeyMgtGetNextVal
  • KeyMgtGetNextValConn
  • They can also be done by Oracle Sequence.

Q. What are the Job parameters?
  • These Parameters are used to provide Administrative access and change run time values of the job.EDIT > JOBPARAMETERS. In that Parameters Tab we can define the name,prompt,type,value.

Q. What is the difference between Hashfile and Sequential File?
  • Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot.

Q. What is the difference between Account and Directory options?
  • Account option create the hashfile in the current project itself using condition and for Directory option we have to specify the path of the directory.

Q. What is the difference between Server Job and Parallel Jobs?
  • Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel Extender work on the principal of pipeline and partition) for Inpur/Output processing.

Q. What are the types of jobs we have in Datastage?
  • Server Jobs
  • Parallel Jobs
  • Mainframe Jobs

Q. What is Staging Variable?
  • These are the temporary variables created in transformer for calculation.

Q. What are Routines?
  • Routines are the functions which we develop in BASIC Code for required tasks, which we Datastage is not fully supported (Complex).

Q. Unix Command to stop Datastage Engine?
  • /bin/uv –admin –stop

Q. Unix Command to start Datastage Engine? 

  • /bin/uv –admin –start

Q. Unix command to check datastage jobs running at server?

  • ps -ef | grep phantom

Q. Unix Command to check Datastage sessions running at backend?

  •  netstat –na | grep dsr 
  •  netstat –a | grep dsr 
  •  netstat –a | grep dsrpc

Q. Where datastage temprory dataset files stored while running a Datastage parallel Job?
  • ResourceScratchDisk

Q. You need to invoke a job from the command line that is a multi-instance enabled. What is the correct syntax to start a multi-instance job?

  • dsjob -run
  • -mode NORMAL

Q. A client must support multiple languages in selected text columns when reading from DB2 database. Which two actions will allow selected columns to support such data?

  • Choose Unicode setting in the extended column attribute.
  • Choose NVar/NVarchar as data types.

Q. What are the types of Containers?
        
       There are Two types of containers namely
  • Local Container
  • Shared Container

Q. Define data aggregation?
  • An operational data source usually contains records of individual transactions such as product sales. If the user of a data warehouse only needs a summed total, you can reduce records to a more manageable number by aggregating the data.

Q. What are stage variables?
  • An intermediate processing variable that retains value during read and doesn’t pass the value into target column.

Q. Which two system variables/techniques must be used in a parallel Transformer derivation to generate a unique sequence of integers across partitions?
  • @PARTITIONNUM
  • @NUMPARTITIONS

Q. What is the default Array Size in OCI stage?

  • 32767

Q. What is the default cache size of Datastage?

  • 256MB

Q. Which is a part of Managing active sessions in Datastage?

  • Viewing all active sessions
  • Setting session limits
  • Opening user details
  • Disconnecting a session

Q. What can you do from the Administrator client?
  • Set up user permissions for projects
  • Purge job log file
  • Set Environment variable default value
  • Add, delete, and move InfoSphere® DataStage® projects

Q. What all are the different way to run a job?
  • Using DataStage Director and DataStage Administration Client.

Q. How can we improve the performance in DataStage?

In server canvas we can improve performance in two ways:
  • Firstly we can increase the memory by enabling interprocess row buffering in job properties
  • Secondly by inserting an IPC stage we break a process into two processes.We can use this stage to connect two passive stages or two active stages.

Q. What is orabulk Stage?
  • This Stage is used to Bulk Load the Oracle Target Database.

Q. How do you schedule or monitoring the job?
  • Using the DataStage Director we can schedule or monitor the job.

Q. What is APT_CONFIG in datastage?
  • Datastage understands the architecture of the system through this file(APT_CONFIG_FILE). For example this file consists information of node names, disk storage information etc.
  • APT_CONFIG is just an environment variable used to idetify the *.apt file.
  • Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.

Q. What are orabulk and bcp stages?
  • These are called as plug-in stages orabulk is used when v have bulk data in oracle then v go for orabulk for other than oracle database we go for bcp stages.

Q. Why fact table is in normal form?
  • A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules.

Q. What is the difference between the local container and shared container?
  • Local Container is local to the particular job in which we developed the container.
  • Shared Container is can be used in any other jobs also.

Q. What can we do with DataStage Director?
  • Validating
  • Scheduling
  • Executing
  • Monitoring Jobs (server Jobs).

Q. Which algorithm you used for your hashfile?
  • It uses GENERAL or SEQ.NUM. algorithm

Q. What all the types of Jobs you developed?
  • Server Job
  • Parallel Job
  • Sequencer Job
  • Container Job.

Q. How do you register plug-ins?
  • Using DataStage Manager.

Q. What are iConv and oConv?

These are the Date functions, which we use to convert the Dates from internal format to External format
  • iConv - External to Internal
  • oConv- Internal to External

Q. How do you convert the columns to rows in DataStage?
  • Using Pivot Stage.

Q. What are the types of input links in datastage?

DataStage supports two types of input link:
  • Stream : A link representing the flow of data. This is the principal type of link, and is used by both active and passive stages.
  • Reference : A link representing a table lookup. Reference links are only used by active stages. They are used to provide information that might affect the way data is changed, but do not supply the data to be changed.

Q. Define dynamic arrays?
  • Dynamic arrays map the structure of DataStage file records to character string data. Any character string can be a dynamic array. A dynamic array is a character string containing elements that are substrings separated by delimiters.

Q. Where the DataStage stored his repository?
  • DataStage stored his repository in IBM Universe Database.

Q. What are the command line functions that import and export the DS jobs?
  • dsimport.exe : imports the DataStage components.
  • dsexport.exe : exports the DataStage components.

Q. What is transformer stage?
  • Transformer stages do not extract data or write data to a target database. They are used to handle extracted data, perform any conversions required, and pass data to another Transformer stage or a stage that writes data to a target data table.

Q. How do you schedule or monitoring the job?
  • Using the DataStage Director we can schedule or monitor the job.

Q. Define universe stage?
  • A stage that extracts data from or loads data into a Universe database using SQL. Used to represent a data source, an aggregation step, or a target data table.

Q. How we can reuse the components?
  • Using the Shared and Local Containers.

Q. What are the Types of Jobs?
  • Server Jobs : Designed, developed and compiled are done in Designer. Validated, Scheduled, Executed and Monitoring are done in Director. Win NT/ Unix system Server.
  • Parallel Jobs : Designed, developed and compiled are done in Designer Validated, Scheduled, Executed and Monitoring are done in Director. Server should be on Unix System Only.
  • Mainframe Jobs : Designed and developed are done in Designer (win) Complied, Validated, Scheduled, Executed and Monitored in Mainframe Systems Only.

Q. In which two situations would not use the Web Services Client stage?
  • You want to deploy a service.
  • You need to create a WSDL.

Q. Which two actions can improve sort performance in a DataStage job?

  • Specify only the key columns which are necessary.
  • Minimize the number of sorts used within a job flow.
  • Adjusting the "Restrict Memory Usage" option in the Sort stage.

Q. You are experiencing performance issues for a given job. You are assigned the task of understanding what is happening at run time for the given job. What are the first two steps you should take to understand the job performance issues?

  • Run job with $APT_TRACE_RUN set to true.
  • Review the objectives of the job.

Q. Your customer asks you to identify which stages in a job are consuming the largest amount of CPU time. Which product feature would help identify these stages?

  • $APT_PM_PLAYER_TIMING

Q. Which three statements describe a DataStage installation in a clustered environment?

  • The conductor node will create the job score and consolidate messages to the DataStage log.
  • For clustered implementations, appropriate node pools are useful to reduce data shipping.
  • Compiled job and all shared components must be available across all servers.

Q. Which three defaults are set in DataStage Administrator?
  • Project level defaults for environment variables.
  • Project level default for compile options
  • Project level default for Runtime Column Propagation

Q. Which two environment variables should be set to "True" to allow you to see operator process statistics at run-time in the job log?
  • $APT_PM_PLAYER_MEMORY
  • $APT_PM_PLAYER_TIMING

Q. Which three statements are true about National Language Support (NLS)
  • NLS must be selected during installation to use it.
  • Within an NLS enabled DataStage environment, maps are used to convert external data into UTF-#6.
  • Reading or writing 7-bit ASCII data from a database does not require NLS support.

Q. Which three statements describe a DataStage installation in a clustered environment?
  • The conductor node will create the job score and consolidate messages to the DataStage log.
  • For clustered implementations, appropriate node pools are useful to reduce data shipping.
  • Compiled job and all shared components must be available across all servers.

Q. Upon which two conditions does the number of data files created by a File Set depend?
  • The number of processing nodes in the default node pool
  • The number of disks in the export or default disk pool connected to each processing node in the default node pool

Q. Which command line switch can be used to return the most recent start time for a given job?
  • dsjob -jobinfo

Q. You are working on a project that contains a large number of jobs contained in many folders. You would like to review the jobs created by the former developer of the project. How can you find these jobs?
  • Use the Advanced Find feature contained in the Designer interface.

Q. Which two statements are true for named node pools?
  • Using appropriately named node pools can allow separation of buffering from sorting disks.
  • Using appropriately named node pools constraints will limit stages to be executed only on the nodes defined in the node pools.

Q. Which three methods can be used to import metadata from a Web Services Description Language (WSDL document)?
  • Web Service Function Definitions
  • XML Table Definitions
  • Web Services WSDL Definitions

Q. What are two tasks that can create DataStage projects?
  • Install the DataStage engine.
  • Add new projects from DataStage Administrator.

Q. Which two conditions does the No of data files created by a Dataset Set depend?
  • The numbers of processing nodes in the default node pool.
  • The number of disks in the export or default disk pool connected to each processing node in the default node pool.

Q. Which requirement must be met to read from a database in parallel using the ODBC Enterprise stage?
  • Specify the partition col property.

Q. For APT_DISABLE_COMBINATION which statements are true?
  • Disabling generates more processes requiring more system resources and memory.
  • Globally disables operator combining.

Q. Techniques you will use to abort a job in Transformer stage?
  • Create a dummy output link with a constraint that tests for the condition to abort on set the "Abort After Rows" property to #.

Q. The dsrpcd daemon is the means by which processes that represent DataStage jobs are started? 
  • The environment that DataStage processes inherit when they are started is the same environment as for dsrpcd. ODBC drivers and some plug-ins require that certain directories are included in the shared library environment variable setting for dsrpcd.

Q. A job design consists of an input sequential file, a Modify stage, followed by a Filter stage and an output Sequential File stage. The job is running on an SMP machine with a configuration file defined with three nodes. No environment variables were set for the job. How many osh processes will this job create?
  • 9

Q. Using a DB2 for z/OS source database, a 200 million row source table with 30 million distinct values must be aggregated to calculate the average value of two column attributes. What would provide optimal performance while satisfying the business requirements?
  • Select all source rows using a DB2 API stage and then Aggregate using a Sort Aggregator.

Q. In your DB2 database you have column names that use characters # and $. Which two steps should be completed to allow DataStage to convert these characters into internal format and back as necessary?
  • Set environment variable $DS_ENABLE_RESERVED_CHAR_CONVERT to true.
  • Avoid using the strings __035__ and __036__ in your IBM DB2 column names.

Q. When invoking a job from a third-party scheduler, it is often desirable to invoke a job and wait for its completion in order to return the job's completion status. Which three commands would invoke a job named "BuildWarehouse" in project DevProject and wait for the job's completion?
  • dsjob -run -jobstatus DevProject BuildWarehouse
  • dsjob -run -userstatus DevProject BuildWarehouse
  • dsjob -run -wait DevProject BuildWarehouse

Thursday, 4 December 2014

Parameters Using Parameter/Value Set/Value File in Datastage


When a file based parameter set is used, the parameter set picks up the contents of the file at the time the job is started. This provides excellent flexibility as long job could manipulate a parameter set file, and leave new parameters for a subsequent job. The key is that the parameters are not compiled into the job or the parameter set. The following steps detail the creation and use of file based parameter sets.


The following steps detail the creation and use of file based parameter sets. 
File > New > Other > Parameter Set

In the general tab, name the parameter set. It should be alphanumeric and underscore characters only and no spaces
For example: my_file_parm_set



   1. In the Parameters tab, create the parameters and defaults


    2. In the Values tab, put the name of the file in the "Value File Name" column. Specify a file name only. No path component

Use alphanumeric characters etc. and no spaces. In the same row, list values for each parameter defined in the Parameters tab. The table should look like this:


         3. Click OK and save the parameter set.


      4. At this point the parameter set file should have been created using the following system:

Projects<project_name>/ParameterSets/<parm_nat_name>/<parm_file_name>

In our example the file name would be: InformationServer\Server\Projects\foo2\ParameterSets\my_file_parm_set\parm_file.txt
The contents of the file looks like this:
foo=Parm file value for foo
bar=Parm file value for bar 

In the job, navigate to Edit > Job Properties and select the Parameters tab. 

Click Add Parameter Set and select the newly created parameter set:


              
******Create the Sequence Job*****
      5. Create a new sequence and add a Job Activity stage to it. Double click on the stage to edit its properties then click on the button to select a Job name. 

    6. Once the job is loaded, its parameters become visible in the Parameters section of the Job Activity screen. The new parameter set should be in the list. Click on the drop down selector in the Value Expression column and select the parameter file:

Click OK and save the sequence.

      7. Now every time the sequence is run, parameters are read from the file at run time and passes to the job. Any preceding jobs in the sequence can change the file to dynamically modify the parameters.

Note: When the number of parameters in a parameter set is changed, all jobs using the parameter set must be recompiled.

*******************************************************************************