Thursday, April 14, 2011

SELECT All that are not in another table

Hi All,

I've inherited a less-than-ideal table structure, and I'm trying to improve it as much as I can without tearing down and rebuilding. There's currently at least two levels of data for everything, the legacy data and the marketing override data. I'm trying to find all the records within the legacy data that don't yet have a marketing override.

So far, this is what I have:

SELECT DISTINCT 
  old.STYLE_NBR, old.COLOR_NBR 
FROM 
  LEGACY_PRODUCT_TABLE old
INNER JOIN 
  MARKETING_PRODUCT_TABLE new
ON old.STYLE_NBR <> new.style_number AND old.COLOR_NBR <> new.colour_number

This seems to work, but it takes a few minutes to run. If at all possible, I'd like a more efficient way of doing this.

Other info:

  • There are about 60,000 records in the legacy table
  • There are about 7,000 in the marketing table
  • Both STYLE_NBR and COLOR_NBR are char(5) and, when combined, make a unique ID.
From stackoverflow
  • Are the join fields indexed? That should speed things up considerably. Make sure old.STYLE_NBR, old.COLOR_NBR, new.style_number, and new.color_number are indexed.

  • I don't know if this will be faster, but it may be worth a try.

    SELECT DISTINCT 
      old.STYLE_NBR, old.COLOR_NBR 
    FROM 
      LEGACY_PRODUCT_TABLE old
    WHERE old.STYLE_NBR, old.COLOR_NBR 
    NOT IN 
    (
        SELECT old.STYLE_NBR, old.COLOR_NBR 
        FROM LEGACY_PRODUCT_TABLE old
        INNER JOIN 
            MARKETING_PRODUCT_TABLE new
            ON 
                old.STYLE_NBR == new.style_number AND old.COLOR_NBR == new.colour_number
    )
    
  • You should be using a LEFT OUTER JOIN and change your lookup

    SELECT DISTINCT 
      old.STYLE_NBR, old.COLOR_NBR 
    FROM 
      LEGACY_PRODUCT_TABLE old
      LEFT OUTER JOIN MARKETING_PRODUCT_TABLE new
        ON (old.STYLE_NBR + old.COLOR_NBR) = (new.style_number + new.Colour_number)
    WHERE (new.style_number + new.Colour_number) IS NULL
    
  • SELECT 
        old.* 
    FROM 
        LEGACY_PRODUCT_TABLE old 
    LEFT JOIN
        MARKETING_PRODUCT_TABLE new 
    ON 
        new.style_number=old.STYLE_NBR AND 
        new.colour_number=old.COLOR_NBR 
    WHERE 
        new.style_number IS NULL;
    

    Stands a better chance of using the indexes which you presumably have on the four columns in question.

  • What about NOT EXISTS?

    SELECT DISTINCT old.STYLE_NBR, old.COLOR_NBR 
    FROM LEGACY_PRODUCT_TABLE old
    WHERE NOT EXISTS
        (SELECT 1 FROM MARKETING_PRODUCT_TABLE new 
        WHERE old.STYLE_NBR = new.style_number 
          AND old.COLOR_NBR = new.colour_number)
    
  • Some options to try are:

    SELECT
        old.STYLE_NBR,
        old.COLOR_NBR
    FROM  
        LEGACY_PRODUCT_TABLE old
    LEFT OUTER JOIN
        MARKETING_PRODUCT_TABLE new
    ON
        old.STYLE_NBR = new.style_number
    AND
        old.COLOR_NBR = new.colour_number
    WHERE
        new.style_number IS NULL
    
    
    
    SELECT
        old.STYLE_NBR,
        old.COLOR_NBR
    FROM  
        LEGACY_PRODUCT_TABLE old
    WHERE
        NOT EXISTS
    (
        SELECT
         *
        FROM
         MARKETING_PRODUCT_TABLE new
        WHERE
         old.STYLE_NBR = new.style_number
        AND
         old.COLOR_NBR = new.colour_number
    )
    

    EDIT: The key thing with both of these is that you are joining using = rather than <>.

  • What you currently have is incorrect because it will return a row for every row that doesn't match, so potentially 6999 rows in the result per row in the legacy table if there is marketing override, or 7000 if there isn't. the distinct will then discard the duplicates, but the result will be wrong because even if there is a marketing matching row, the non-matching ones will make sure the result set will include the ones where there is no row.

    Try this instead:

    select distinct style_nbr, color_nbr
     from legacy_product_table L
    where not exists
    (
       select * from marketing_product_table m
       where m.style_nbr = L.style_nbr and m.color_nbr = L.color_nbr
    )
    

    Make sure the product table has an index on (style_nbr,color_nbr).

    HLGEM : I prefer the left join method, but I like the way you explained why the orginal was incorrect.
  • -- What about EXCEPT? (if this is SQL Server 2005 or 2008) select old.Style_NBR, Old.Color_NBR except select new.Style_NBR, new.Color_NBR

    -- try the code below in mssql 2008

    declare @Old table( Color_Nbr tinyint, Style_Nbr tinyint )

    declare @New table ( Color_Nbr tinyint, Style_Nbr tinyint )

    insert into @Old values (1,1), (2,2), (3,3), (4,4)

    insert into @New values (1,1), (2,2), (3,3), (5,5)

    select o.Color_Nbr, o.Style_Nbr from @Old o

    except

    select n.Color_Nbr, n.Style_Nbr from @New n

0 comments:

Post a Comment