Apache CouchDB to store text

So far the application will find Parts of Speech and Named Entities. For anything more than this we really need a way of working on a larger scale piece of work like a novel. I have chosen Apache CouchDB for this purpose. It's a simple json based database.

There are handy bulk insert operations so that all 5419 paragraphs of the book can be uploaded in about a second.

docs = []

for id, content in enumerate(cleaned_paragraphs):

    docs.append({"id": id, "content": content})

db.bulk_docs(docs)


Next up is to apply our POS tagger to the text on a paragraph by paragraph basis, or globally for the whole novel. The fact that couchdb is unstructured is a huge advantage here. We can update the structure of some paragraphs to have POS tags for example and not the structure of others. Our python code can decide how to deal with the returned structures.

Comments

Popular posts from this blog

Execute Jupyter notebooks line by line in VS Code

Using TensorFlow Serving

Text Summarisation with BERT