21 February 2020

Text transformations.

I converted a pdf to txt  in order to feed it to a text-to-speech program in order to generate an audiobook.

The first conversion (using the tool from the link above) already removed the page numbers, footer, heading,  (other converters did not) but I had to do some extra cleaning.

Below are the terms
SEARCH
REPLACE
that can be used within Sublime text editor to easily clean the whole text.

Note: Whenever you see " (double quotes) below, remove it before inserting it in Sublime. I just left it to make sure the spaces are visible when is not obvious.

# remove - at the end of a line and join lines
-\n
""

# join lines
(\w)\n(\w)
\1 \2


(,)\n(\w)
\1 \2

# Note that here I am looking for lowercase on the second letter
(\w)\n\n([a-z])
\1 \2


(\w)\n\n\n([a-z])
\1 \2


", {2,9}\n"
", "


" {2,9}\n"
" "

0 Comments:

Post a Comment

<< Home

Too Cool for Internet Explorer