Quality stage :
Investigate Stage:
Investigate: 3 methods:
1.chardiscreate->C,T,X masks
2.Charconcatenate:C,T.X masks
Investigate default column names for Pattern Report:
1.Qsinvcolumn name:
2.QsInvPattern
3.QsInvsample
4.QsInvcount
5.Qsinvpercentgae
Investigate default column names for column Report:
1.QsInvcount
2.QsInvword
3.QsInvclasscode
Lab:
Chardiscreate C mask (select one or many columns)
Characterconcatenate C MASK(select two or more columns concate nate)
WordInvstgate:FullName:
Token Rpt
Pattern Rpt
WordInvestigate:Address(pass address line 1,address line2)
Token Rpt
Pattern Rpt
WordInvestigate:Area(city ,state,Zip)
Token Rpt
Pattern Rpt
1.Why investigate:
à Discover trends and potential anomalies in data
à Identify invalid and default values in a data
à Verify the reliability of the data in the fields to be used as a matching criteria
à Gain complte understanding of the data in a context
Investiage:
Verify the domain:
Review each field and verify the data matches the meta data
Identify the data formats and missing and default values
Identify the data anomalies:
Format
Structure
Content
Feature of investigate:
Analyze free form and single domain columns
Provide frequency distribution of distinct values and patterns
Investigaet methods:
Character discrete
Character concatenate
2.Standardize stage:
1.country identifier:
--- >select the rule set from others COUNTRY
--- > pass the literal ZQUSZQ and add the columns addressline1,addressline 2,city ,state,zip
--- > filter the records where ever we have flag ‘Y’ Those or US records
--- >split US, non US records into separate target
2. Apply the USPREP rule set to filter name components from address fields, and area components from address fields
n ->Select USPREP rule set from standardize rules
n ->pass ZQNAMEZQ and add the column “Fullname”
n ->pass ZQADDRZQ and add the column “addressline1”
n ->pass ZQADDRZQ and add the column “addressline2”
n ->pass ZQAREAZQ and add the column “City”
n ->pass ZQAREAZQ and add the column “State”
n ->pass ZQAREAZQ and add the column “Zip”
Standardize USNAME USADDR USAREA
1.Select USNAME rule set from standardize rules and add the clumn NameDomain_USPREP
2. select new process and select the USADDR rule set and add the column AddressDomain_USPREP
3. select new process and select the USAREA rule set and add the column AreaDomain_USPREP
Rules Columns
USNAME.SET NameDomain_USPREP
USADDR.SET AddressDomain_USPREP
USAREA.SET AreaDomain_USPREP
Investigate un handled name patterns
Take the above job as input and use 3 investigate stages
1 for Inv Unhandled Name
2. for InvUnhandeldAddr
3.for InvUnhandledArea
Inv Unhandled Name:
select the method character concatenate for Name
select the columns
UnhandledPattern_USNAME, --- >set C mask
UnhandledData_USNAME--- >set X mask
InputPattern_USNAME--- >set X mask
NameDomain_USPREP--- >set X mask
InvUnhandeldAddr:
select the method character concatenate for Address
select the columns
UnhandledPattern_USADDR, --- >set C mask
UnhandledData_USADDR--- >set X mask
InputPattern_USADDR--- >set X mask
AddressDomain_USPREP--- >set X mask
InvUnhandeldArea:
select the method character concatenate for Area
select the columns
UnhandledPattern_USAREA, --- >set C mask
UnhandledData_USAREA--- >set X mask
InputPattern_USAREA--- >set X mask
AreaDomain_USPREP--- >set X mask
No comments:
Post a Comment