Quality Stage in data stage Functionality
Four tasks are performed by Quality Stage in Data stage ; they are standardization, investigation, survivor ship and matching . We need to look at each of these in turn. Under the covers Quality Stage in data stage incorporates a set of probabilistic matching algorithms that can find potential duplicates in data despite variations in spelling, numeric or date values, use of non-standardization forms, and various other obstacles to performing the same tasks using deterministic methods. For example, if you have what appears to be the same employee record where the name is the same but date of hire differs by a day or two, a deterministic algorithm would show two different employees where as a probabilistic algorithm would show the potential duplicate. This is called Quality Stage in Data Stage Functionality.
Click Here to Quality Stage in Data Stage 8.1 Full Notes
Quality stage Investigation stage in Data stage :
By investigation we mean inspection of the data to reveal certain types of information about those data. There is some overlap between Quality Stage investigation stage in data stage and the kinds of profiling results that are available using Information Analyzer, but not so much
overlap as to suggest that removal of functionality from either tool. Quality Stage can
undertake three different kinds of investigation.
Character discrete investigation looks at the characters in a single field (domain) to
report what values or patterns exist in that field. For example a field might be expected
to contain only codes A through E. A character discrete investigation looking at the
values in that field will report the number of occurrences of every value in the field (and
therefore any out of range values, empty or null, etc.) “Pattern” in this context means
whether each character is alphabetic, numeric, blank or something else. This is useful in
planning cleansing rules; for example a telephone number may be represented with or
without delimiters and with or without parentheses surrounding the area code, all in
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer determines which.
Character concatenate investigation stage in data stage is exactly the same as character discrete
investigation except that the contents of more than one field can be examined as if they
were in a single field – the fields are, in some sense, concatenated prior to the
investigation taking place. The results of a character concatenate investigation can be
useful in revealing whether particular sets of patterns or values occur together.
Word investigation is probably the most important of the three for the entire
Quiality Stage in data stage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.
A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a investigation stage in data stage or word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
Quality Stage in data stage.
investigation stage in data stage might also be performed to review the results of standardization,
particularly to see whether there are any un handled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides. this is the theoretical information of the Investigation Stage in Data Stage.
Quality Stage Standardization Stage in Data Stage :
Standardization Stage in Data Stage, as the name suggests, is the process of generating standard forms of
data that might more reliably be matched. For example, by generating the standard
form “William” from “Bill”, then there is an increased likelihood of finding the match
between “William Gates” and “Bill Gates”. Other standard forms that can be generated
include phonetic equivalents (using NYSIIS and/or Soundex), and something like
“initials” – maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token classification
tables, a rule set includes specification of the format of an output record structure, into
which original and standardized forms of the data, generated fields (such as gender) and
reporting fields (for example whether a user override was used and, if so, what kind of
override) may be written.
It may be that Standardization Stage in Data Stage is the desired end result of using Quality Stage in Data Stage 8.1 . For example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records. Standardization Stage in Data Stage can convert all the non-standard forms into whatever standard format the organization has decided that it will use. This kind of Quality Stage in Data Stage 8.1 job can be set up as a web service. For example, a data entry application might send in an address to be standardized. The web service would return
the standardized address to the caller. More commonly Standardization Stage in Data Stage is a preliminary step towards performing matching. More accurate matching can be performed if standard forms of words or tokens are compared than if the original forms of these data are compared.
Matching Stage in Quality Stage
Matching Stage in Quality Stage is the real heart of Quality Stage in Data Stage 8.1. Different probabilistic algorithms are available for different types of data. Using the frequencies developed during investigation (or subsequently), the information content (or “rarity value”) of each value in each field can be estimated. The less common a value, the more information it contributes to the decision. A separate agreement weight or disagreement weight is calculated for each field in each data record, incorporating both its information content (likelihood that a match actually has been found) and its probability that a match has been found purely at random. These weights are summed for each field in the record to come up with an aggregate weight that can be used as the basis for reporting that a particular pair or records probably are, or probably are not, duplicates of each other. There is a third possibility, a “grey area” in the middle, which Quality Stage in Data Stage 8.1 refers to as the “clerical review” area – record pairs in this category need to be referred to a human
to make the decision because there is not enough certainty either way. Over time the algorithms can be tuned with things like improved rule sets, weight overrides, different settings of probability levels and so on so that fewer and fewer “clericals” are found. Matching Stage in Quality Stage makes use of a concept called “blocking”, which is an unfortunately-chosen term that means that potential sets of duplicates form blocks (or groups, or sets) which can be treated as separate sets of potentially duplicated values. Each block of potential duplicates is given a unique ID, which can be used by the next phase (survivorship) and
can also be used to set up a table of linkages between the blocks of potential duplicates and the keys to the original data records that are in those blocks. This is often a requirement when de-duplication is being performed, for example when combining records from multiple sources, or generating a list of unique addresses from a customer file, create. More than one pass through the data may be required to identify all the potential duplicates. For example, one customer record may refer to a customer with a street
address but another record for the same customer may include the customer’s post office box address. Searching for duplicate addresses would not find this customer; an additional pass based on some other criteria would also be required. Quality Stage in Data Stage 8.1 does provide for multiple passes, either fully passing through the data for each pass, or only examining the unmatched records on subsequent passes (which is usually faster).
Click here for More Data Stage 8.1 Notes :
Four tasks are performed by Quality Stage in Data stage ; they are standardization, investigation, survivor ship and matching . We need to look at each of these in turn. Under the covers Quality Stage in data stage incorporates a set of probabilistic matching algorithms that can find potential duplicates in data despite variations in spelling, numeric or date values, use of non-standardization forms, and various other obstacles to performing the same tasks using deterministic methods. For example, if you have what appears to be the same employee record where the name is the same but date of hire differs by a day or two, a deterministic algorithm would show two different employees where as a probabilistic algorithm would show the potential duplicate. This is called Quality Stage in Data Stage Functionality.
Click Here to Quality Stage in Data Stage 8.1 Full Notes
Quality stage Investigation stage in Data stage :
By investigation we mean inspection of the data to reveal certain types of information about those data. There is some overlap between Quality Stage investigation stage in data stage and the kinds of profiling results that are available using Information Analyzer, but not so much
overlap as to suggest that removal of functionality from either tool. Quality Stage can
undertake three different kinds of investigation.
Character discrete investigation looks at the characters in a single field (domain) to
report what values or patterns exist in that field. For example a field might be expected
to contain only codes A through E. A character discrete investigation looking at the
values in that field will report the number of occurrences of every value in the field (and
therefore any out of range values, empty or null, etc.) “Pattern” in this context means
whether each character is alphabetic, numeric, blank or something else. This is useful in
planning cleansing rules; for example a telephone number may be represented with or
without delimiters and with or without parentheses surrounding the area code, all in
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer determines which.
Character concatenate investigation stage in data stage is exactly the same as character discrete
investigation except that the contents of more than one field can be examined as if they
were in a single field – the fields are, in some sense, concatenated prior to the
investigation taking place. The results of a character concatenate investigation can be
useful in revealing whether particular sets of patterns or values occur together.
Word investigation is probably the most important of the three for the entire
Quiality Stage in data stage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.
A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a investigation stage in data stage or word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
Quality Stage in data stage.
investigation stage in data stage might also be performed to review the results of standardization,
particularly to see whether there are any un handled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides. this is the theoretical information of the Investigation Stage in Data Stage.
Quality Stage Standardization Stage in Data Stage :
Standardization Stage in Data Stage, as the name suggests, is the process of generating standard forms of
data that might more reliably be matched. For example, by generating the standard
form “William” from “Bill”, then there is an increased likelihood of finding the match
between “William Gates” and “Bill Gates”. Other standard forms that can be generated
include phonetic equivalents (using NYSIIS and/or Soundex), and something like
“initials” – maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token classification
tables, a rule set includes specification of the format of an output record structure, into
which original and standardized forms of the data, generated fields (such as gender) and
reporting fields (for example whether a user override was used and, if so, what kind of
override) may be written.
It may be that Standardization Stage in Data Stage is the desired end result of using Quality Stage in Data Stage 8.1 . For example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records. Standardization Stage in Data Stage can convert all the non-standard forms into whatever standard format the organization has decided that it will use. This kind of Quality Stage in Data Stage 8.1 job can be set up as a web service. For example, a data entry application might send in an address to be standardized. The web service would return
the standardized address to the caller. More commonly Standardization Stage in Data Stage is a preliminary step towards performing matching. More accurate matching can be performed if standard forms of words or tokens are compared than if the original forms of these data are compared.
Matching Stage in Quality Stage
Matching Stage in Quality Stage is the real heart of Quality Stage in Data Stage 8.1. Different probabilistic algorithms are available for different types of data. Using the frequencies developed during investigation (or subsequently), the information content (or “rarity value”) of each value in each field can be estimated. The less common a value, the more information it contributes to the decision. A separate agreement weight or disagreement weight is calculated for each field in each data record, incorporating both its information content (likelihood that a match actually has been found) and its probability that a match has been found purely at random. These weights are summed for each field in the record to come up with an aggregate weight that can be used as the basis for reporting that a particular pair or records probably are, or probably are not, duplicates of each other. There is a third possibility, a “grey area” in the middle, which Quality Stage in Data Stage 8.1 refers to as the “clerical review” area – record pairs in this category need to be referred to a human
to make the decision because there is not enough certainty either way. Over time the algorithms can be tuned with things like improved rule sets, weight overrides, different settings of probability levels and so on so that fewer and fewer “clericals” are found. Matching Stage in Quality Stage makes use of a concept called “blocking”, which is an unfortunately-chosen term that means that potential sets of duplicates form blocks (or groups, or sets) which can be treated as separate sets of potentially duplicated values. Each block of potential duplicates is given a unique ID, which can be used by the next phase (survivorship) and
can also be used to set up a table of linkages between the blocks of potential duplicates and the keys to the original data records that are in those blocks. This is often a requirement when de-duplication is being performed, for example when combining records from multiple sources, or generating a list of unique addresses from a customer file, create. More than one pass through the data may be required to identify all the potential duplicates. For example, one customer record may refer to a customer with a street
address but another record for the same customer may include the customer’s post office box address. Searching for duplicate addresses would not find this customer; an additional pass based on some other criteria would also be required. Quality Stage in Data Stage 8.1 does provide for multiple passes, either fully passing through the data for each pass, or only examining the unmatched records on subsequent passes (which is usually faster).
Click here for More Data Stage 8.1 Notes :
No comments:
Post a Comment