Finding Duplicate Documents in SharePoint using PowerShell

This script looks through all documents stored in a SharePoint site collection and finds duplicate files based on document contents rather than document names. This script has been written for SharePoint 2010 but should find duplicate documents in SharePoint 2007 as well with very little modification.

Over time, it is quite possible the same document will be uploaded to numerous SharePoint libraries. Keeping track of duplicate content spread across multi libraries can be practically impossible.

Building on the article here http://blog.codeassassin.com/2007/10/13/find-duplicate-files-with-powershell/ that details duplicate checking on fileshares, the following PowerShell script scans all your document libraries with a site collection for duplicate content by calculating an MD5 hash of the file contents. The script groups identical hashes and produces a list of all duplicated files, detailing the full url to item and the file name.

To run the script, copy the contents to notepad and save as a .ps1 file on one of your SharePoint servers. Then launch a PowerShell console and run the ps1 file.

Output to console showing duplicate files

PowerShell Console Output

The function returns the full path of all duplicated content.

This information could be piped back into SharePoint, or exported to Excel for analysis. You could even set this as a recurring job. In a future article, I will package this functionality into a timer job feature.

At present, the script stores all results in memory while it is running. If you are running this over a large site collection this may not scale very well. It may be worth streaming the results into a SQL table or similar. Also, at present this script will only evaluate content on a site collection basis but could be scoped to a web application or a whole farm if required.

Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue

function Get-DuplicateFiles ($RootSiteUrl)

{

$spSite = Get-SPSite -Identity $RootSiteUrl

$Items = @()

foreach ($SPweb in $spSite.allwebs)

{

Write-Host “Checking ” $spWeb.Title ” for duplicate documents”

foreach ($list in $spWeb.Lists)

{

if($list.BaseType -eq “DocumentLibrary” -and $list.RootFolder.Url -notlike “_*” -and $list.RootFolder.Url -notlike “SitePages*”)

{

foreach($item in $list.Items)

{

$record = New-Object -TypeName System.Object

if($item.File.length -gt 0)

{

$fileArray = $item.File.OpenBinary()

$hash = Get-MD5($fileArray)

$record | Add-Member NoteProperty ContentHash ($hash)

$record | Add-Member NoteProperty FileName ($file.Name)

$record | Add-Member NoteProperty FullPath ($spWeb.Url + “/” + $item.Url)

$Items += $record

}

}

}

}

$spWeb.Dispose()

$duplicateHashes = $Items | Group-Object ContentHash | Where-Object {$_.Count -gt 1}

foreach($duplicatehash in $duplicateHashes)

{

$duplicateFiles += $Items | Where-Object{$_.contentHash -eq $duplicatehash.Name}

$duplicateFiles += “————————————————————”

}

}

return $duplicateFiles |Format-Table FullPath

}

function Get-MD5($file = $(throw ‘Usage:Get-MD5[System.IO.FileInfo]‘))

{

$stream = $null;

$cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider];

$hasAlgorithm = New-Object $cryptoServiceProvider

$stream = $file;

$hashByteArray = $hasAlgorithm.ComputeHash($stream);

return [string]$hashByteArray;

}

Get-DuplicateFiles(<your sharepoint site>)

share