We’d like to remind Forumites to please avoid political debate on the Forum.

This is to keep it a safe and useful space for MoneySaving discussions. Threads that are – or become – political in nature may be removed in line with the Forum’s rules. Thank you for your understanding.

📨 Have you signed up to the Forum's new Email Digest yet? Get a selection of trending threads sent straight to your inbox daily, weekly or monthly!
The Forum now has a brand new text editor, adding a bunch of handy features to use when creating posts. Read more in our how-to guide

Help searching a document

I hope some of you clever people can help me with my thesis.
It's a long document - will be about 80 000 words and, as you can imagine, has been through a lot of changes and restructurings along the way.

So I am now afraid that I may have ended up using exactly the same sentence/paragraph more than once in different places, and I need a way to check for that. Of course I don't know what the sentence would be, so a normal edit/find will not work.

So, what I need is a way to search for any string of, say 5 or more words, which is repeated in the document.
Is there a way to do that???

Would be VERY grateful for any help, as this would be a massive task to try to check manually...

Comments

  • Mr_Toad
    Mr_Toad Posts: 2,462 Forumite
    Might be possible but please tell me you intend to proof read it yourself!

    Computers aren't anywhere near as good as the Mk1 brain at this kind of thing.

    As part of the proof reading process you should be able to spot anything similar or repetitive as well a gauging the overall structure of the thesis.
    One by one the penguins are slowly stealing my sanity.
  • slopemaster
    slopemaster Posts: 1,584 Forumite
    Part of the Furniture 1,000 Posts Combo Breaker
    Oh yes, of course!
    I have already started proof-reading, and it was the vague feeling of 'haven't I said this somewhere else?' that sparked the Q.

    The problem is tho', that it can be hard to know whether I moved it from another section, or duplicated it in another section...

    I also have a generous offer from another human (my OH) to proof-read, and he is good at spotting small mistakes which I miss becos I read what "should" be there...

    But, it seemed to me that this is something a computer would actually do better than a human...
  • RockPaperScissors
    RockPaperScissors Posts: 359 Forumite
    Debt-free and Proud!
    edited 11 December 2012 at 4:10PM
    Hmmm...

    I agree I think it could be better to proof it yourself. There's no easy way though there are some programs that I can't vouch for such as this

    It's quite unlikely that you will have written the exact sentence unless, as you say, you copy and pasted. But 9 times out of ten you would likely yhave re-written it slightly.

    It will be frustrating but I think the best thing would be to work your way through proof reading, and every time you read something that feels a little "too" familar, write it down separately and then forget about it. Then check over the compiled list of "de ja vu" later and see if there are any matches.

    Alternatively if something feels familiar, do the Ctrl + F on one of the key words in the sentence and see where else it appears.

    It will be time consuming but well worth it in the end! Congrats on writing it! :)
    DEBT FREE AT LAST!
    Virtual Sealed Pot Challenge 2014 - Member 161
    Single Pot 1 Total:£23.32
    Joint Account Pot Total:£6.67
  • slopemaster
    slopemaster Posts: 1,584 Forumite
    Part of the Furniture 1,000 Posts Combo Breaker
    Thanks, I'll have a look at that program
  • slopemaster
    slopemaster Posts: 1,584 Forumite
    Part of the Furniture 1,000 Posts Combo Breaker
    Well. it seems the problem is much more complex than I realised.

    That program only finds repeated LINES, so that won't work for me. (As the repeated phrase or sentence might start at a different point on the line)

    A pretty exhaustive search has found lots of people asking the same Q as me, but no real answers!

    Still, if anyone does know, I'll be eternally grateful.
  • loudcox
    loudcox Posts: 179 Forumite
    You don't say what you're working, but if it's Word, it should be fairly easy to accomplish - but computationally pretty darn expensive! I've thrown together a very basic bit of VBA to tokenise the text and search for matching chunks of a certain length. (It's not been checked

    The larger the document, and the smaller the CHUNK_SIZE below the slower it will be! Current position is displayed on the application status bar. Alice in Wonderland (30K words) takes about a minute on my high spec laptop an finds nine repeats on a chunk size of 10.

    If you're not familiar with VBA, take a look at http://msdn.microsoft.com/en-us/library/office/ee814737(v=office.14).aspx
    1. TAKE A BACKUP OF YOUR DOCUMENT
    2. Open the VBA Editor
    3. Insert a new module
    4. Paste the code below
    5. Change CHUNK_SIZE to an appropriate number
    6. Run FindMatches
    7. It'll produce a new document listing the word number, phrase, and the word where the match starts - your original should be untouched.
    8. MAKE SURE YOU TAKE A BACKUP FIRST THOUGH!
    Function GetAlpha(s As String) As String
        Dim sOut As String
        
        Dim i As Integer
        sOut = ""
        For i = 1 To Len(s)
        If Mid(s, i, 1) Like "[A-Z,a-z,0-9]" Then sOut = sOut + Mid(s, i, 1)
        Next i
        
        GetAlpha = sOut
        
        
    End Function
    
    
    Public Sub FindMatches()
    
    'words in repeated phrase
        Const CHUNK_SIZE As Integer = 5
    
        Dim aWords() As String
        Dim aClean() As String
        Dim strSearch As String
        Dim strMatch As String
        Dim idxThis As Long
        Dim idxNext As Long
        Dim s As String
        Dim uB As Long
        Dim i As Long
        Dim d As Document
        Dim dInfo As Document
      Dim blnFound As Boolean
        Set d = ThisDocument
        Set dInfo = Application.Documents.Add
    
        aWords = Split(d.Range.Text, " ")
        uB = UBound(aWords) - 1
        ReDim aClean(uB + 1)
        i = 0
        For idxThis = 0 To uB
            s = Trim(GetAlpha(aWords(idxThis)))
            
            
            
            
            If s <> "" Then
                aClean(i) = s
                i = i + 1
            End If
        Next idxThis
        uB = i
    
        
        idxThis = 0
        Do While idxThis < uB - CHUNK_SIZE
        blnFound = False
            If idxThis Mod 200 = 0 Then Application.StatusBar = CStr(idxThis)
            DoEvents
            strSearch = ""
            For i = 0 To CHUNK_SIZE - 1
                strSearch = strSearch + aClean(idxThis + i) + " "
            Next i
            strSearch = Trim(strSearch)
    
            For idxNext = idxThis + CHUNK_SIZE To uB - CHUNK_SIZE
    
                If aClean(idxThis) = aClean(idxNext) Then
    
                    strMatch = ""
                    For i = 0 To CHUNK_SIZE - 1
                        strMatch = strMatch + aClean(idxNext + i) + " "
                    Next i
                    strMatch = Trim(strMatch)
                    If strSearch = strMatch Then
                        dInfo.Range.InsertAfter CStr(idxThis) + "," + strSearch + "," + CStr(idxNext) + vbNewLine
                      blnFound = True
                    
                    End If
                
                End If
    
            Next idxNext
    If blnFound Then idxThis = idxThis + CHUNK_SIZE Else idxThis = idxThis + 1
        Loop
    
        Application.StatusBar = False
        Set dInfo = Nothing
        Set d = Nothing
    End Sub
    
    [SIZE=-1]te audire non possum. musa sapientum fixa est in aure.[/SIZE]
This discussion has been closed.
Meet your Ambassadors

🚀 Getting Started

Hi new member!

Our Getting Started Guide will help you get the most out of the Forum

Categories

  • All Categories
  • 353.5K Banking & Borrowing
  • 254.2K Reduce Debt & Boost Income
  • 455.1K Spending & Discounts
  • 246.6K Work, Benefits & Business
  • 603K Mortgages, Homes & Bills
  • 178.1K Life & Family
  • 260.6K Travel & Transport
  • 1.5M Hobbies & Leisure
  • 16K Discuss & Feedback
  • 37.7K Read-Only Boards

Is this how you want to be seen?

We see you are using a default avatar. It takes only a few seconds to pick a picture.