Friday, March 30, 2012

itzik ben-gan

I am using the following query which I found from a fragmanet of code
by itzik ben-gan to assign a common group id for group of records in my
case which have similar SSN and first Name and Last Name. if the SSN is
the same it should also check the first name and last name of the
record. Becuase records have more than three AKA names, I need to check
all the possibilities of first name last name combination to verify the
records are the same.
This code works fine and can assign group numbers for all the rows.
I am trying this code on a database of 65,000 rows. It's taking around
20 minute to complete. but I'll have to run the same code on
800,000,000 rows.
It will take years to finish.
even if the query is optimized to run in 1 second for 65,000 rows, it
will take more than 4 hours to run on the 800,000,000 row. This where I
realized I am in the "wrong jungle".
1. is there any other feasible and faster way to do this? very
important issue.
2. while assigning group number, it doesn't give sequential numbers. It
skips some of the numbers( group Number 1,2 5,9...) just curiouse(
not very important)
SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS grp_num
into tmp_FLS
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
GO
distinct firstname, lastname and SSN table from the tu_people_data.
I created this table to increase the query performance.
CREATE TABLE [distFLS] (
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL
) ON [PRIMARY]
GO
CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL ,
CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
(
[tu_id],
[count_id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
sample data
there is a column count_id after the tu_id and before fname(tu_id and
count_id are primary keys)
tu_id fname lname SSN
156078480 KRISINA WALSH 999999000
156078480 KRISTINA GIERER 999999000
156078480 KRISTINA WALSH 999999000
151257883 J SOTO 999999111
151257883 JOSE LARIOS 999999111
151257883 JOSE SOTO 999999111
151257883 L SOTO 999999111
136312525 ELADIO GARCIA 999999222
136312525 ELADIO NAVA 999999222
136312525 ELADIO NAVAGARCIA 999999222
136312525 GARCIA NAVA 999999222
149180940 DARREN SAUERWINE 999999333
149180940 DARREN SUAERWIN 999999333First, examine the 65,000 T-SQL in Query Analyzer using the Show Execution
Plan feature and confirm that the select portion of the query is using
efficient index ss.
http://msdn.microsoft.com/library/d... />
1_5pde.asp
Also, when inserting, updating, or deleting a massive amount of data (ex > 1
million rows), you start running into series issues with the CPU, I/O, and
disk storage consumed by transaction logging. I assure you that attempting
to update 800 million rows in a single batch will take much longer than 4
hours, regardless of your server configuration. Do a goole search of the
*sqlserver* newsgroups using the keywords "transaction log" "million" and
"hours". There are techniques for minimizing transaction logging and
performing the updates or inserts in batches using a looping method. You
will also need to coordinate with the network admin and allocate the storage
space on the SAN ahead of time.
http://groups.google.com/groups? as...n
um=100
Once done, there are also the issues of logical index fragmentation and
physical extent fragmentation, etc.
Don't fire this up on a Friday evening and expect it to run over the
wend; it could be a w long project at best.
<jacob.dba@.gmail.com> wrote in message
news:1141687843.303092.247700@.i39g2000cwa.googlegroups.com...
>I am using the following query which I found from a fragmanet of code
> by itzik ben-gan to assign a common group id for group of records in my
> case which have similar SSN and first Name and Last Name. if the SSN is
> the same it should also check the first name and last name of the
> record. Becuase records have more than three AKA names, I need to check
> all the possibilities of first name last name combination to verify the
> records are the same.
> This code works fine and can assign group numbers for all the rows.
> I am trying this code on a database of 65,000 rows. It's taking around
> 20 minute to complete. but I'll have to run the same code on
> 800,000,000 rows.
> It will take years to finish.
> even if the query is optimized to run in 1 second for 65,000 rows, it
> will take more than 4 hours to run on the 800,000,000 row. This where I
> realized I am in the "wrong jungle".
> 1. is there any other feasible and faster way to do this? very
> important issue.
> 2. while assigning group number, it doesn't give sequential numbers. It
> skips some of the numbers( group Number 1,2 5,9...) just curiouse(
> not very important)
> SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS grp_num
> into tmp_FLS
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
> GO
> distinct firstname, lastname and SSN table from the tu_people_data.
> I created this table to increase the query performance.
> CREATE TABLE [distFLS] (
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL
> ) ON [PRIMARY]
> GO
>
> CREATE TABLE [TU_People_Data] (
> [tu_id] [bigint] NOT NULL ,
> [count_id] [int] NOT NULL ,
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL ,
> CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
> (
> [tu_id],
> [count_id]
> ) ON [PRIMARY]
> ) ON [PRIMARY]
> GO
>
> sample data
> there is a column count_id after the tu_id and before fname(tu_id and
> count_id are primary keys)
> tu_id fname lname SSN
> 156078480 KRISINA WALSH 999999000
> 156078480 KRISTINA GIERER 999999000
> 156078480 KRISTINA WALSH 999999000
> 151257883 J SOTO 999999111
> 151257883 JOSE LARIOS 999999111
> 151257883 JOSE SOTO 999999111
> 151257883 L SOTO 999999111
> 136312525 ELADIO GARCIA 999999222
> 136312525 ELADIO NAVA 999999222
> 136312525 ELADIO NAVAGARCIA 999999222
> 136312525 GARCIA NAVA 999999222
> 149180940 DARREN SAUERWINE 999999333
> 149180940 DARREN SUAERWIN 999999333
>

No comments:

Post a Comment