Step Three: Cleanup – paragraph breaks, Remove: excess white space, hidden special hidden characters, tabs

The text we receive tends need a lot of clean up before we can start marking up everything. Copying from .indd files can introduce unexpected paragraph breaks, white space, and hidden characters, but so can word docs. Word docs are especially prone to not copying over references/inline citations into the body text for example, and the many people who write the text tend to have different styles of writing: some start new sentences with two spaces, or end paragraphs with white space, random white space or tabs on blank lines in between paragraphs, etc.

It is best to work in a program that allows you to view whitespace, hidden characters, and uses syntax highlighting. Even better, find a program that supports code/markup clippings/snippets with key bindings/keyboard shortcuts and advanced find/replace functions, GREP and Regex/Regular Expressions. These tools are going to speed up your workflow, reduce tedium, reduce errors, and things so much easier.

It is important to strip excess whitespace because the microsite web files act as our archive of the text in a readily available form. You’re gunna help yourself out and others in the future by helping us keep our code and content clean and sustainable. Thanks in advance!

In this step, our goal is to clean up the following:

Paragraph breaks
If you didnt already do so in the setup phase, go through each chapter and check that there are no bad paragraph breaks from the copy/past process. Also make sure that all body paragraphs are separated from each other by a blank line.

Remove Excess whitespace

Double spacebars, like this can appear anywhere and everywhere. Missing inline citations number can produce them but also old school writers/editors who start sentences like that are also a contributing source.

White space at the end of paragraphs

White space on blank lines

Tabs: we dont really need them, none of our markup uses them, so just delete any that pop up

Remove hidden characters and other artifacts

forced line breaks, thin space, en space, em space, non breaking space, those weird dots that I still to this day can not explain   <-like that (can only be seen with hidden characters on) Check em dashes and elispses are spaced out

usually doesnt come up any more, but sometimes it does, our em dashes — like this one need spaces on both sides, as do our elipses … like that one.

Remove page number references

Sometimes figure mentions and callout mentions have page numbers if you’re copying out of an indd, these need to be removed for the microsite