Converting PDF and DOCX files to text
For a data analysis project on CVs, I needed to mass convert PDF and DOCX (Microsoft Word) files to text files (TXT). The aim was to apply Machine Learning algorithms on the data.
The Apache Tika library allowed me to easily convert these documents (hundreds of documents in only few seconds).
brew install tika
To get the list of available commands:
All the documents that I wanted to convert were placed in a folder
input. With the following command, all
.docx documents were converted to
.txt documents in the folder
tika --text -i ~/Desktop/input/ -o ~/Desktop/output/