tero.co.uk

Extract text from PDFs

The function below will extract text from a PDF file and return it as a string. It can be used as a quick way for searching for matching text in PDF files. It is a short version of the code at community.livejournal.com/php/295413.html by Jon Beckett (I think).

Copy the code below and then call it using a file name, or by passing in the PDF data. Note that it sometimes concatenates words because spacing is controlled by other PDF parameters, and it only works on the FlateDecode text in the PDF (I'm not sure what this means except that some PDFs won't work).

function ExtractTextFromPdf ($pdfdata) {
	if (strlen ($pdfdata) < 1000 && file_exists ($pdfdata)) $pdfdata = file_get_contents ($pdfdata); //get the data from file
	if (!trim ($pdfdata)) echo "Error: there is no PDF data or file to process.";
	$result = ''; //this will store the results
	//Find all the streams in FlateDecode format (not sure what this is), and then loop through each of them
	if (preg_match_all ('/<<[^>]*FlateDecode[^>]*>>\s*stream(.+)endstream/Uis', $pdfdata, $m)) foreach ($m[1] as $chunk) {
		$chunk = gzuncompress (ltrim ($chunk)); //uncompress the data using the PHP gzuncompress function
		$chunk = iconv ('UTF-8', 'ASCII//TRANSLIT', $chunk); //suggested in comments to code above to remove junk characters
		//If there are [] in the data, then extract all stuff within (), or just extract () from the data directly
		$a = preg_match_all ('/\[([^\]]+)\]/', $chunk, $m2) ? $m2[1] : array ($chunk); //get all the stuff within []
		foreach ($a as $subchunk) if (preg_match_all ('/\(([^\)]+)\)/', $subchunk, $m3)) $result .= join ('', $m3[1]); //within ()
	}
	else echo "Error: there is no FlateDecode text in this PDF file that I can process.";
	return $result; //return what was found
}