+1 vote

Removing words from hyphenatedWords.txt that no longer exist in the project

I am trying to prepare the hyphenatedWords.txt file for native speakers to work on. I want to remove all words that are actually not in the project. However there are many words in the project that were at one time marked as Correct or Incorrect that no longer exist in the project. Even though thy are now marked as spelling status Unknown, they continue to appear in hyphenatedWords.txt if they have approved hyphenation.
I tried exporting the word list to XML, changing hyphenationApproved to false and importing, however this did not successfully change their hyphenation status. Example:
I changed the hyphenationApproved attribute from true to false for the following entry
<item word="abaaramaadha" count="0" hyphenation="abaara=maa=dha" hyphenationApproved="False" morphology="abaar +amaa +dha" morphologyApproved="False" spelling="Incorrect" correction="abaaramaa dha" />
I then imported this back into Wordlist. Nevertheless, after I opened up Wordlist, the hyphenation was still marked as approved. It appears the only way I can successfully unapprove hyphenation is to individually tick every one of the approved hyphenations in the Word list tool.

Removing entries that no longer exist in the project which are clearly wrong

A second concern is that I cannot remove from the Wordlist database any of the thousands of words that were entered into Wordlist and reviewed over the 10+year lifetime of the project. For example the straight apostrophe was changed to a character modifier apostrophe. We would like to remove those 139 entries. There are even 233 “words” with spaces in them in the database like this:
<item word="a si" count="0" hyphenation="a si" hyphenationApproved="False" spelling="Incorrect" />
It appears that Wordlist keeps a memory of any word that made it into the project and was reviewed. But there is a need to remove words that are no longer legal words in order to make it easier to work with the tool. With over 43,000 words in the project I would love to clean up the Wordlist database and hopefully make it run faster.

Paratext by (1.8k points)
reshown

4 Answers

+1 vote
Best answer

As I mentioned above the Wordlist tool exports all words with a hyphenation status to hyphenatedWords.txt. This may include words no longer in the project and words in books you possibly may not be interested in checking.

I have developed an Excel spreadsheet to get a list of words in hyphenatedWords.txt format that only includes the words in a portion of Scripture you are interested in. It includes a count and percentage of words that have approved hyphenation, and a count of words with unapproved hyphenation. You can copy the list to the project’s hyphenatedWords.txt file and then edit it to update the hyphenation status of words in that portion.

Before you start

  1. Get the Excel file hyphenatedWords.xlst file here: https://biblica.box.com/v/hyphenatedWordsTool
  2. Put it somewhere on your computer where you can find it easily.
    Suggestion: put the file on your Desktop or in the My Paratext 8 Projects folder

Prerequisite: You must have an Excel version that supports Power Query

Instructions

  1. In Paratext, open the Wordlist tool for the particular project you are interested in.
  2. Make sure the hyphenation column is visible. (From the Wordlist menu under
    View click Show Hyphenation.)
  3. Make sure All words is chosen in the Words filter .
  4. Use the Verses filter to choose the portion you are interested in.
  5. From the Wordlist menu under Wordlist click Export to XML…
  6. Create or overwrite the file C:\My Paratext 8 Projects\hyphenatedWords.xml
    Important: The file hyphenatedWords.xml in the C:\My Paratext 8 Projects is a temporary file used by the hyphenatedWords.xlst Excel spreadsheet. If you do not have C:\My Paratext 8 Projects folder on your computer you will need to create it; or you can modify hyphenatedWords.xlst to look for a file in a folder of your choice where you plan to export the Wordlist data you want to transform.
  7. Open hyphenatedWords.xlst
  8. Click the Data tab
  9. Click Queries and connects (optional)
  10. Click Refresh All
    Result: The hyphenatedWords sheet will display a list of words in hyphenatedWords.txt format that only includes the words in the portion of the project you exported to XML. It will include a line in the header with the number of approved words, their percentage, the number of unapproved words, and the total number of words exported.

Please let me know if you find this helpful or have any suggestions for improvements

by (1.8k points)
reshown
0 votes

If you’re saying you get too many items showing in the Wordlist (I think that’s what you’re saying), make sure that View > Show reviewed words which no longer exist in project is not checked.

by [Expert]
(16.2k points)

No, actually what I am saying is I am seeing too many words in hyphenatedWords.txt.
I am seeing hyphenations for words that are no longer in the project and have not been approved as correct. However, since their hyphenations have been approved, this appears to be the reason why these words are included in hyphenatedWords.txt.
In addition I am saying there is no effective way of removing the hyphenation approval from words except by ticking it off one by one in the Wordlist interface. This has been an issue for may years.

0 votes

I’m confused. Why do you care what’s in the data files if you aren’t seeing them in the Wordlist?

by [Expert]
(16.2k points)

Because the tools to manipulate the hyphenated words in the Wordlist tool are next to non-existent You cannot filter on patterns in the hyphenated words and you cannot use regular expressions to make changes to the list. This is why we do a lot of work on hyphenated words outside of the tool on hyphenatedWords.txt.
The problem is that there are words in the hyphenatedWords.txt file that do not appear in the Wordlist. Many of these words are misspellings. I want to filter them out before working on the hyphenatedWords.txt file.

Back in Paratext 7 days I discovered that you could edit the hyphenatedWords.txt file directly. To get Paratext to load the changes you have to quit the programme and restart it. The hyphenation data is cached somewhere. It is not enough to reload your project. That was Paratext 7; I don’t know if it holds true in Paratext 9.

It may be helpful to remove the “hyphenatedWords.txt.BAK” file from your project’s folder. Then when you edit the the “hyphenatedWords.txt” file (outside of PT), close PT, make sure the updated txt is there and the BAK is not, and then restart the program. Just an idea, if you haven’t tried it already.

0 votes

CrazyRocky - to remove approval from the hyphenated words, you must open hyphenatedWords.txt and remove the * from the beginning of each approved word.

by (8.4k points)

What you say is true, however the words that do not exist in the project are mixed with the words that do exist and the difference is not marked in hyphenatedWords.txt. I only want to remove the * from the words that no longer exist in the project—technically I only want to remove the * from the words that no longer exist in the project and have Unknown or Incorrect spelling status.
I thought I could do this by exporting the Wordlist as XML, manipulating the XML and re-importing it, however the hyphenation status did not change when I did that.
I could accomplish what I need, I suppose, by exporting the Wordlist as XML, manipulating the XML into a new hyphenatedWords.txt and importing that.

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
There is neither Jew nor Gentile, neither slave nor free, nor is there male and female, for you are all one in Christ Jesus.
Galatians 3:28
2,628 questions
5,369 answers
5,045 comments
1,420 users