Finding candidate keys/Primary keys from data that needs to be imported

This is a continuation from the previous post where we went and inserted data from multiple CSV files into SQL server tables using linked server connection. Often the next step is to identify how the tables are related to each other. While there is no guarantee that the relationships defined here are 100% accurate it will help identify the obvious ones and make the job easier.

First we get a count of rows in the table, we can use this to compare with counts from distinct values of the columns later. The counts are stored in a table so that we can perform other checks later if needed.

CREATE TABLE [dbo].[TblRowCounts] (
    [id] [int] IDENTITY(1, 1) NOT NULL
    ,[tabname] [varchar](1000) NULL
    ,[counts] [int] NULL
    ) ON [PRIMARY]
GO

INSERT INTO TblRowCounts  (
    tabname
    ,counts
    )
EXEC sp_msforeachtable 'select "?",count(*) from ?';

Now that we have the table row counts, we get the distinct Column values counts for each table. If the column is a candidate we should see the numbers match. First we create a table to store the counts for each column

CREATE TABLE [dbo].[ColRowCounts] (
    [id] [int] IDENTITY(1, 1) NOT NULL
    ,[tabname] [varchar](1000) NULL
    ,[colname] [varchar](100) NULL
    ,[counts] [int] NULL
    ) ON [PRIMARY]

Next we calculate the columns distinct values count by using a cursor.

DECLARE @tabname VARCHAR(100)
DECLARE @colname VARCHAR(100)

DECLARE loopcol CURSOR
FOR
SELECT TABLE_NAME
    ,COLUMN_NAME
FROM INFORMATION_SCHEMA.cOLUMNS
WHERE DATA_TYPE NOT LIKE '%text%'

OPEN loopcol

FETCH NEXT
FROM loopcol
INTO @tabname
    ,@colName

WHILE @@FETCH_STATUS = 0
BEGIN
    DECLARE @sql NVARCHAR(4000) = ' insert into ColRowCounts (tabname , colname , counts) select' + + CHAR(39) + @tabname + CHAR(39) + + ' ,' + CHAR(39) + @colname + CHAR(39) + ' , count(distinct ([' + @colname + '])) from ' + @tabname + ''

    EXEC (@sql)

    FETCH NEXT
    FROM loopcol
    INTO @tabname
        ,@colname
END

CLOSE loopcol

DEALLOCATE loopcol

With the row counts inserted into the tables we can now compare the numbers to see columns which have distinct counts the same as the total number of rows in the table , in other words columns with unique values.

SELECT c.tabname
    ,c.colname
FROM [dbo].[TblRowCounts] t
INNER JOIN ColRowCounts c ON t.tabname = '[dbo].[' + c.tabname + ']'
    AND t.counts = c.counts

A slightly modified version of this script allows me to script out the alter table command that will convert the above columns to NOT NULLABLE as should be the case based on the numbers we got from the previous scripts.

WITH cte
AS (
    SELECT c.tabname
        ,c.colname
    FROM [dbo].[TblRowCounts] t
    INNER JOIN ColRowCounts c ON t.tabname = '[dbo].[' + c.tabname + ']'
        AND t.counts = c.counts
    )
SELECT 'ALTER TABLE ' + t.tabname + ' ALTER COLUMN [' + colname + '] ' + DATA_TYPE + CASE 
        WHEN CHARACTER_MAXIMUM_LENGTH IS NOT NULL
            THEN + '(' + cast(CHARACTER_MAXIMUM_LENGTH AS VARCHAR(100)) + ')'
        ELSE ''
        END + ' NOT NULL'
FROM cte t
INNER JOIN INFORMATION_SCHEMA.COLUMNS c ON t.tabname = c.TABLE_NAME
    AND t.colname = c.COLUMN_NAME

With the columns being not nullables they are now ready to have primary keys built on top of them. To build out the script I use the below code block to generate scripts but do not execute them without reviewing the column data first.

WITH cte
AS (
    SELECT c.tabname
        ,c.colname
    FROM [dbo].[TblRowCounts] t
    INNER JOIN ColRowCounts c ON t.tabname = '[dbo].[' + c.tabname + ']'
        AND t.counts = c.counts
    )
SELECT 'ALTER TABLE ' + t.tabname + ' ADD CONSTRAINT PK_' + replace(colname, ' ', '') + ' PRIMARY KEY CLUSTERED ([' + colname + '])'
    ,count(*)
FROM cte t
GROUP BY t.tabname
    ,colname
ORDER BY t.tabname

With the Primary keys created the next step obviously is to create the Foreign Keys,  in order to do this we rely on the column names being the same in both tables. If the Primary key column name can be found on any other table we assume the two columns are related and go ahead with creating the PK- FK Relationship.

DROP TABLE #fktables GO

SELECT row_number() OVER (
        ORDER BY c.table_name
        ) AS rid
    ,'Alter table ' + c.TABLE_NAME + ' Add constraint FK_' + c.TABLE_NAME + replace(c.COLUMN_NAME, ' ', '') + ' FOREIGN KEY ([' + c.COLUMN_NAME + ']) REFERENCES ' + k.TABLE_NAME + ' ([' + k.COLUMN_NAME + ']) ' AS def
INTO #fktables
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE k
INNER JOIN INFORMATION_SCHEMA.COLUMNS c ON c.COLUMN_NAME = k.COLUMN_NAME
    AND c.TABLE_NAME != k.TABLE_NAME
ORDER BY c.TABLE_NAME

DECLARE @cnt INT = 1

WHILE @cnt <= (
        SELECT max(rid)
        FROM #fktables
        )
BEGIN TRY
    DECLARE @sql NVARCHAR(2000) = (
            SELECT def
            FROM #fktables
            WHERE rid = @cnt
            )

    EXEC (@sql)

    SET @cnt = @cnt + 1
END TRY

BEGIN CATCH
    SET @cnt = @cnt + 1
END CATCH

With this we have taken raw data from CSV files imported them into SQL Server and then established meaningful relationships between previously unrelated data in disjoint tables. The rest of the work is pretty much standard for DB developers e.g creating SPs etc.

Please Consider Subscribing