We’d like to remind Forumites to please avoid political debate on the Forum.
This is to keep it a safe and useful space for MoneySaving discussions. Threads that are – or become – political in nature may be removed in line with the Forum’s rules. Thank you for your understanding.
📨 Have you signed up to the Forum's new Email Digest yet? Get a selection of trending threads sent straight to your inbox daily, weekly or monthly!
The Forum now has a brand new text editor, adding a bunch of handy features to use when creating posts. Read more in our how-to guide
Help searching a document
slopemaster
Posts: 1,584 Forumite
in Techie Stuff
I hope some of you clever people can help me with my thesis.
It's a long document - will be about 80 000 words and, as you can imagine, has been through a lot of changes and restructurings along the way.
So I am now afraid that I may have ended up using exactly the same sentence/paragraph more than once in different places, and I need a way to check for that. Of course I don't know what the sentence would be, so a normal edit/find will not work.
So, what I need is a way to search for any string of, say 5 or more words, which is repeated in the document.
Is there a way to do that???
Would be VERY grateful for any help, as this would be a massive task to try to check manually...
It's a long document - will be about 80 000 words and, as you can imagine, has been through a lot of changes and restructurings along the way.
So I am now afraid that I may have ended up using exactly the same sentence/paragraph more than once in different places, and I need a way to check for that. Of course I don't know what the sentence would be, so a normal edit/find will not work.
So, what I need is a way to search for any string of, say 5 or more words, which is repeated in the document.
Is there a way to do that???
Would be VERY grateful for any help, as this would be a massive task to try to check manually...
0
Comments
-
Might be possible but please tell me you intend to proof read it yourself!
Computers aren't anywhere near as good as the Mk1 brain at this kind of thing.
As part of the proof reading process you should be able to spot anything similar or repetitive as well a gauging the overall structure of the thesis.One by one the penguins are slowly stealing my sanity.0 -
Oh yes, of course!
I have already started proof-reading, and it was the vague feeling of 'haven't I said this somewhere else?' that sparked the Q.
The problem is tho', that it can be hard to know whether I moved it from another section, or duplicated it in another section...
I also have a generous offer from another human (my OH) to proof-read, and he is good at spotting small mistakes which I miss becos I read what "should" be there...
But, it seemed to me that this is something a computer would actually do better than a human...0 -
Hmmm...
I agree I think it could be better to proof it yourself. There's no easy way though there are some programs that I can't vouch for such as this
It's quite unlikely that you will have written the exact sentence unless, as you say, you copy and pasted. But 9 times out of ten you would likely yhave re-written it slightly.
It will be frustrating but I think the best thing would be to work your way through proof reading, and every time you read something that feels a little "too" familar, write it down separately and then forget about it. Then check over the compiled list of "de ja vu" later and see if there are any matches.
Alternatively if something feels familiar, do the Ctrl + F on one of the key words in the sentence and see where else it appears.
It will be time consuming but well worth it in the end! Congrats on writing it!
DEBT FREE AT LAST!
Virtual Sealed Pot Challenge 2014 - Member 161
Single Pot 1 Total:£23.32
Joint Account Pot Total:£6.670 -
Thanks, I'll have a look at that program0
-
Well. it seems the problem is much more complex than I realised.
That program only finds repeated LINES, so that won't work for me. (As the repeated phrase or sentence might start at a different point on the line)
A pretty exhaustive search has found lots of people asking the same Q as me, but no real answers!
Still, if anyone does know, I'll be eternally grateful.0 -
You don't say what you're working, but if it's Word, it should be fairly easy to accomplish - but computationally pretty darn expensive! I've thrown together a very basic bit of VBA to tokenise the text and search for matching chunks of a certain length. (It's not been checked
The larger the document, and the smaller the CHUNK_SIZE below the slower it will be! Current position is displayed on the application status bar. Alice in Wonderland (30K words) takes about a minute on my high spec laptop an finds nine repeats on a chunk size of 10.
If you're not familiar with VBA, take a look at http://msdn.microsoft.com/en-us/library/office/ee814737(v=office.14).aspx- TAKE A BACKUP OF YOUR DOCUMENT
- Open the VBA Editor
- Insert a new module
- Paste the code below
- Change CHUNK_SIZE to an appropriate number
- Run FindMatches
- It'll produce a new document listing the word number, phrase, and the word where the match starts - your original should be untouched.
- MAKE SURE YOU TAKE A BACKUP FIRST THOUGH!
Function GetAlpha(s As String) As String Dim sOut As String Dim i As Integer sOut = "" For i = 1 To Len(s) If Mid(s, i, 1) Like "[A-Z,a-z,0-9]" Then sOut = sOut + Mid(s, i, 1) Next i GetAlpha = sOut End Function Public Sub FindMatches() 'words in repeated phrase Const CHUNK_SIZE As Integer = 5 Dim aWords() As String Dim aClean() As String Dim strSearch As String Dim strMatch As String Dim idxThis As Long Dim idxNext As Long Dim s As String Dim uB As Long Dim i As Long Dim d As Document Dim dInfo As Document Dim blnFound As Boolean Set d = ThisDocument Set dInfo = Application.Documents.Add aWords = Split(d.Range.Text, " ") uB = UBound(aWords) - 1 ReDim aClean(uB + 1) i = 0 For idxThis = 0 To uB s = Trim(GetAlpha(aWords(idxThis))) If s <> "" Then aClean(i) = s i = i + 1 End If Next idxThis uB = i idxThis = 0 Do While idxThis < uB - CHUNK_SIZE blnFound = False If idxThis Mod 200 = 0 Then Application.StatusBar = CStr(idxThis) DoEvents strSearch = "" For i = 0 To CHUNK_SIZE - 1 strSearch = strSearch + aClean(idxThis + i) + " " Next i strSearch = Trim(strSearch) For idxNext = idxThis + CHUNK_SIZE To uB - CHUNK_SIZE If aClean(idxThis) = aClean(idxNext) Then strMatch = "" For i = 0 To CHUNK_SIZE - 1 strMatch = strMatch + aClean(idxNext + i) + " " Next i strMatch = Trim(strMatch) If strSearch = strMatch Then dInfo.Range.InsertAfter CStr(idxThis) + "," + strSearch + "," + CStr(idxNext) + vbNewLine blnFound = True End If End If Next idxNext If blnFound Then idxThis = idxThis + CHUNK_SIZE Else idxThis = idxThis + 1 Loop Application.StatusBar = False Set dInfo = Nothing Set d = Nothing End Sub[SIZE=-1]te audire non possum. musa sapientum fixa est in aure.[/SIZE]0
This discussion has been closed.
Confirm your email address to Create Threads and Reply
Categories
- All Categories
- 353.5K Banking & Borrowing
- 254.2K Reduce Debt & Boost Income
- 455.1K Spending & Discounts
- 246.6K Work, Benefits & Business
- 603K Mortgages, Homes & Bills
- 178.1K Life & Family
- 260.6K Travel & Transport
- 1.5M Hobbies & Leisure
- 16K Discuss & Feedback
- 37.7K Read-Only Boards