Extracting Text from Journal Articles: Postoperative Imaging of Sarcomas


I’m mixing it up a little bit with this post.  Yesterday I was able to successfully extract the text using pyPDF2 but the text was jumbled.  It cleaned up okay using the .replace() function but the text was still out of order.

After looking at GhostScript, I figured I might give my go-to PAID pdf app a shot and it didn’t fail me.  I used NitroPro to export text and it was largely unpurturbed from the original format.


There are still a few formatting issues which yesterday’s python can easily clean up.  No need for imported modules.

You may ask why go through the trouble of extracting the text?  I often summarize my learning points on this site so I can easily reference them and include tables to add to my reference page.  If you copy from the web page, you end up with this text appended.

Read More: https://www.ajronline.org/doi/full/10.2214/AJR.18.19954?action=autoLogin&sso_token=39F…0D2334

Essentially, you would need to delete the JavaScript that adds that link and reload the page.  This isn’t really practical just to read one article.
So with a little bit of python, it goes to:
Still not perfect but pretty darn good for copypasta into a summary.
Script Below:

with open('C:\\Users\\g\\Downloads\\ajr.txt', 'r', encoding='utf-8') as file_obj:
    in_text = file_obj.read()
    with open('C:\\Users\\g\\Downloads\\ajr_clean.txt', 'w+', encoding='utf-8') as outfile:
        in_text.replace('\n',' ')
        in_text = (in_text.replace('- ',''))

  • Multiple studies in the literature support a correlation between local recurrence of soft tissue sarcoma and high-risk factors such as intermediateor high-grade tumor, tumor larger than 5 cm, deep location, multifocally positive surgical margins, and absence of wide resection [1, 12–14]. Mortality from soft tissue sarcoma has been associated with local recurrence, tumor larger than 10 cm, deep location, high grade, and positive surgical margins
  • While CT scan is often preferred due to its greater sensitivity in detecting small lung nodules, it is unknown whether this provides benefit over CXR alone. Both modalities are considered highly appropriate for this purpose by the American College of Radiology (ACR) [74].
    • This is questionably accurate.  CT is recommended by the ACR.  While CXR is appropriate and would be considered reasonable, CT  is preferred
    • Although chest radiographs were historically used [6], unenhanced chest CT has become the recommended modality. Surveillance intervals vary from 3 to 6 months in the first several years to annually up to 10 years (Tables 1 and 2).
  • A retrospective review performed in the United Kingdom found that CXR alone detected two-thirds of pulmonary metastases in patients with soft tissue sarcoma; when compared with CT as the “gold standard,” the sensitivity, specificity, positive predictive value, and negative predictive value of CXR were 60.8, 99.6, 93.3, and 96.7 percent, respectively [29]. The use of CXR only to stage the lungs would have missed one-third of all patients with lung metastases, but because of the infrequency of lung metastases overall (96 of 1170 patients), the initial staging would have been inaccurate in only 3.1 percent of cases.
    • Commentary:  There is a CME question on this one and I believe the question is either vague or their answer is just wrong.  First, it isn’t in the article.  A review of primary literature sources showed a rate far lower than 1/3 and the summary above
  • Radiation-induced sarcomas have different histologic composition than the patients’ original treated tumors; within the field of treatment, therefore, MRI characteristics widely differ [14]. High-grade undifferentiated pleomorphic sarcoma is the most common postradiation sarcoma of the soft tissues, representing two-thirds of radiation-induced sarcoma. Extraskeletal osteosarcoma and fibrosarcoma follow, representing 13% and 11% of cases [14]. Conversely, osteosarcoma is by far the most common radiation-induced malignancy affecting bone, accounting for approximately 60% of cases [25]. Undifferentiated pleomorphic sarcoma is a distant second, accounting for approximately 20% of cases [14].


Postoperative Imaging of the Ankle

Achilles Tendon Repair

  • In acute injury with functionally limiting partial-thickness tear or complete tear with less than 3 cm tendon gap, a direct end-to- end anastomotic repair is preferable [9].
    • In this repair the proximal and distal tendon stumps are mobilized and directly anastomosed
  • MRI or ultrasound imaging performed within the first 2 months after surgery may show a residual tendon gap at the site of anastomosis related to postsurgical granulation. This gap should fill in with T2-intermediate fibrous material by 14 weeks [12]. Hetero – geneous intersubstance T2 signal intensity persists as long as 12 months postoperative- ly, after which the tendon assumes a round- ed morphologic appearance as much as 4–6 times the diameter of the contralateral unaf- fected tendon [13, 14] (Fig. 1).

Code to extract text from the PDF, wordpress screwed the formatting, as usual.

import PyPDF2
import dateutil.parser as dparser
from dateutil.parser import parse
import re

file_obj = open('article.pdf', 'rb') # print(file_obj)
pdfReader = PyPDF2.PdfFileReader(file_obj) # creating a pdf reader object
pagecount = pdfReader.numPages
with open('outfile.txt', 'wb+') as outfile:
for pagenbr in range ( 0 ,pagecount - 1):
pagetxt = (pdfReader.getPage(pagenbr).extractText().replace('\n',' '))
pagetxt = (pagetxt.replace('-','').encode("utf-8"))


Backing up your RPi or SBC SD card

I’ve been going through iterations of SBC OS building and really don’t want to start from scratch, especially on my most basic boards.

The alterative to doing this is to use Win32 Disk Imager to create an .img file but they are huge, the same size as the disk.  It reads all sectors, whether they have data or not.  Writing them takes about as long as reinstalling an OS.

This is a handy command for creating a compressed image to stick on the Sd card somewhere.


SSH into the SBC

Then run

fdisk -l

to get a listing of your drives.


dd bs=4M if=/dev/"YOUR SD MOUNT HERE" | gzip > OrangePi-`date +%d%m%y`.img.gz

NB: this writes it to the SD card on the SBC so you better have room for it.  You can then move it off for safe keeping.   Alternatively, you could back it up to a USB stick.






Weird Mesenteric AVM

This patient has at least 3 things going on. The most obvious is a tangle of vessels in the small bowel mesentery, presumably representing an AVM of some type although I cannot recall seen the mesenteric arterial portal shunt type AVM before. This makes me wonder about a diagnosis of HHT. Additionally, the patient has a very redundant sigmoid colon with acute diverticulitis in the right lower quadrant near the appendix. Further, the patient has stenosis of the celiac axis with presumably poststenotic dilation. The dilation could also be in the spectrum of vascular disease associated with the AVM.

Stab Wound to the Forearm with Ulnar Artery Pseudoaneurysm

This is a very interesting case of stab wound to the forearm which lacerated if not transected the ulnar artery producing a large bilobed pseudoaneurysm. There is also edema in the distal flexor compartment compatible with myonecrosis associate with ischemia/infarction following the stab wound.



Unfortunately, this exemplifies an all too common event technologists not reconstructing according to the patient anatomy, which really ends up giving a crappy view of some uncommon and interesting pathology