However I often run into that these come as scanned documents, Discourses on Livy and Politics Among Nations for example.
I would greatly benefit from turning these into text. I can snipping tool pages and put them in ChatGPT and it turns out perfect. If I used classic methods, it often screws up words. My final goal is to turn these into audiobooks, (or even just make it easier to copypaste for my personal notes)
Given the state of AI, I'm wondering what my options are. I don't mind paying.
If you're dealing with public domain material, you can just upload to archive.org. They'll OCR the whole thing and make it available to you and everyone else. (If you got it from archive.org, check the sidebar for the existing OCR files.)
- Original book image: https://imgur.com/a8KxGpY
- OCR from archive.org: https://imgur.com/VUtjiON
- Output from Claude: https://imgur.com/keUyhjR
Note if you're dealing with a work (or edition) that cannot otherwise be found on archive.org, though, then if you do upload it, you are permitted as the owner of that item to open up the OCRed version and edit it. So an alternative workflow might be better stated:
1. upload to archive.org
2. check the OCR results
3. correct a local copy by hand or use a language model to assist if the OCR error rate is too high
4. overwrite the autogenerated OCR results with the copy from step 3 in order to share with others
(For those unaware and wanting to go the collaborative route, there is also the Wikipedia-adjacent WMF project called Wikisource. It has the upside of being more open (at least in theory) than, say, a GitHub repo—since PRs are not required for others to get their changes integrated. One might find, however, it to be less open in practice, since it is inhabited by a fair few wikiassholes of the sort that folks will probably be familiar with from Wikipedia.)
The original transformer had dense attention where every token attends to every other token, and the computational cost therefore grew quadratically with increased context length. There are other attention patterns than can be used though, such as only attending to recent tokens (sliding window attention), or only having a few global tokens that attend to all the others, or even attending to random tokens, or using combinations of these (e.g. Google's "Big Bird" attention from their Elmo/Bert muppet era).
I don't know what types of attention the SOTA closed source models are using, and they may well be using different techniques, but it'd not be surprising if there was "less attention" to tokens far back in the context. It's not obvious why this would affect a task like doing page-by-page OCR on a long PDF though, since there it's only the most recent page that needs attending to.
Also, PDF size isn’t a relevant measurement of token lengths when it comes to PDFs which can range from a collection of high quality JPEG images to thousand(s) of pages of text
(albeit I believe o3-mini isn't natively multimodal)
(But eg I asked it about something Martin Short / John Mulaney said on SNL and it needed 2 prompts to get the correct answer..... the first answer wasn't making anything up it was just reasonably misinterpreting something)
It also has web search which will be more accurate if the pages it reads are good (it uses bing search, so if possible provide your own links and forcibly enable web search)
Similarly the latest Anthropic Claude Sonnet model (it's the new Sonnet 3.5 as of ~Oct) is very good.
The idea behind o3 mini is that it only knows as much as 4o mini (the names suck, we know) but it will be able to consider its initial response and edit it if it doesn't meet the original prompt's criteria
I'd also tried ocrit, which uses Apple's Vision framework for OCR, with some success - https://github.com/insidegui/ocrit
It's an ongoing, iterative process. I'll watch this thread with interest.
Some recent threads that might be helpful:
* https://news.ycombinator.com/item?id=42443022 - Show HN: Adventures in OCR
* https://news.ycombinator.com/item?id=43045801 - Benchmarking vision-language models on OCR in dynamic video environments - driscoll42 posted some stats from research
* https://news.ycombinator.com/item?id=43043671 - OCR4all
(Meaning, I have these browser tabs open, I haven't fully digested them yet)
https://news.ycombinator.com/item?id=42952605 - Ingesting PDFs and why Gemini 2.0 changes everything
I can’t help but think a few amateur humans could have read the pdf with their eyes and written the markdown by hand if the OCR was a little sketchy.
I've done this with a few documents from the French and Spanish national archives, which were originally provided as enormous non-OCRed PDFs but shrank to 10% the size (or less) after passage through archive.org and incidentally became full-text-searchable.
Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes.
If you want purely black and white output (e.g. if the PDF has yellowing pages and/or not-quite-black text, but doesn't have many illustrations), you can extract just the monochrome foreground layer from each page and ignore the color layers entirely.
First, extract the images using mutool extract in.pdf
Then delete the sRGB images.
Then combine the remaining images with imagemagick command line: convert -negate *.png out.pdf
This gives you a clean black and white PDF without any of the color information or artifacts from the background layer.
Here's a script that does all that. It worked with two different PDFs from IA. I haven't tested it with other sources of MRC PDFs. The script depends on mutool and imagemagick.
https://gist.github.com/rahimnathwani/44236eaeeca10398942d2c...
All of these are OSS, and you don't need to pay a dime to anyone.
[0]: https://github.com/VikParuchuri/surya
[1]: https://github.com/JaidedAI/EasyOCR
[2]: https://github.com/PaddlePaddle/Paddle
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?
2. I think it would be enough. People do great work with much less.
3. I think Surya would handle it. I have done mostly flat text. I would also try some LLM OCR models like Google Gemini 2.0 Flash with different pipelines. With different system prompts. I am yet to do this. It would be easy to check. About fonts - never really worried about it myself. If it's something fancy, and you are crazy enough, you will create a font. Or you can also use some handwriting mimicry tool using another AI model. I don't have a name on top of my head. Look through OCR models. Indian college and HS kids still have to submit handwritten projects and assignments. Some crafty kids use such tools to type (or chatgpt copy-paste) and then print in pen ink color in their own handwriting, and fool the teacher given there are a large number of assignments to check.
4. I am not sure if I understand the question fully. Do you mean that books' pages will have numbers, and they will be read as book text in your OCRed data? If you mean that, then I just used GOF regex to root page numbers out. When you have the full text without page numbers, there are multiple tools to create EPUBs and PDF's. You can also reformat documents, assuming you already have an EPUB or PDF- based on the target device, using just Calibre.
1. I don't understand the question. You mean any other kind of scan than regular scanning? I don't know at all. I just work with regularly scanned documents.
For instance, Discourses on Livy:
https://www.gutenberg.org/cache/epub/10827/pg10827-images.ht...
https://www.gutenberg.org/ebooks/10827
There is a lot of enthusiasm around language models for OCR and I have found that generally they work well, however I have had much better results, especially if there are tables etc., by sending the raw page to the llm along with the ocrd page, and asking it transcribe from the image and validate words/character sequences against the ocr.
This largely solves for numbers and things being jumbled or hallucinated.
I recently tested llamaparse after trying it a year prior and was very impressed. You may be able to do your project on the free tier, and it will do a lot of this for you.
You can scan a book and listen (also copy and paste the text extracted to other apps).
If you are looking to do this on large scale in your own UI, I would recommend either of Google solutions:
1. Google Cloud Vision API (https://cloud.google.com/vision?hl=en)
2. Using Gemini API OCR capabilities.(Start here: https://aistudio.google.com/prompts/new_chat)
then
https://products.aspose.app/pdf/conversion/pdf-to-txt
Makes the layout of the space for the text
we wrote the post and created Pulse [2] for these exact use cases, feel free to reach out for more info!
[1]: https://news.ycombinator.com/item?id=42966958 [2]: https://runpulse.com
It seems straightforward except for the canvas strip (I assume this is part of the binding?), and whether you add thicker pages/boards on each side as covers.
Do you have any photos of the process, or at least of a finished product? Thanks!
https://fixmydocuments.com/
I also made a simple iOS app that basically just uses the built in OCR functionality on iPhones and automatically applies it to all the pages of a PDF. It won’t preserve formatting, but it’s quite accurate in terms of OCR:
https://apps.apple.com/us/app/super-pdf-ocr/id6479674248
https://news.ycombinator.com/item?id=43043671
https://www.ocr4all.org/
I’ve begun thinking I dislike the rigid format of non -fiction books. Few ideas bulked out and iterated, with lots of context and examples. Takes me ages to get through and I have very little free time. Cliffs notes are awful because you need some iteration and emotional connection to make the content stick.
I’d love a version of a book that has variable rates of summarization and is navigable around points and themes, so I can hop about while ensuring I don’t miss a key fact or insight buried somewhere
You can also pre-digest many such books in audio form (increasingly using what are now fairly powerful and tolerable text-to-speech tools), and dive in to read specific passages of note.
Because there's a formula to the book structure, you'll often find theory/overview / solutions presented in the introductory and concluding chapters or sections, with the mid-bulk section largely consisting of illustrations. There's an exceptionally ill-conceived notion that 1) one must finish all books one begins and 2) one must read all of a book. Neither of these are true, and I find books most usefully engaged as conversations with an author (books are conversations over time, telecoms are communications over space), and to read with a view to addressing specific goals: understanding of specific topics / problems / solutions, etc. This can cut your overall interaction with tedious works.
There's also of course a huge marketing dynamic to book publishing on which one of the best treatments is Arthur Schopenhauer's "On Authorship": (trans. 1897)
... Writing for money and reservation of copyright are, at bottom, the ruin of literature. No one writes anything that is worth writing, unless he writes entirely for the sake of his subject. What an inestimable boon it would be, if in every branch of literature there were only a few books, but those excellent! This can never happen, as long as money is to be made by writing. It seems as though the money lay under a curse; for every author degenerates as soon as he begins to put pen to paper in any way for the sake of gain. The best works of the greatest men all come from the time when they had to write for nothing or for very little. And here, too, that Spanish proverb holds good, which declares that honor and money are not to be found in the same purse—honora y provecho no caben en un saco. The reason why Literature is in such a bad plight nowadays is simply and solely that people write books to make money. A man who is in want sits down and writes a book, and the public is stupid enough to buy it. The secondary effect of this is the ruin of language. ...
<https://en.wikisource.org/wiki/The_Art_of_Literature/On_Auth...>
It can correctly read many other languages than English if that is something you need. Previously I tried others and there were many errors in conversion. This does it well.
"Surprisingly, ChatGPT-4o gave the best Markdown output overall. Asking a multimodal LLM to simply convert a document to Markdown might be the best option if slow processing speed and token cost are not a problem."
https://ai.gopubby.com/benchmarking-pdf-to-markdown-document...
Anyone have success with prompting them to "just give me the text verbatim?"
More of a system: AWS Textract or Azure Document Intelligence. This option requires some coding and the cost is higher than using a vision model.
I haven’t used it yet, since my use case ended up not needing more than just sending to an LLM directly.
[1] https://blog.medusis.com/38_Adventures+in+OCR.html
For OP, there is a library written in rust that can do exactly what you need with very high accuracy and performant [1].
Would need to OCR dependencies to get it to work on scanned books [2].
[1] https://github.com/yobix-ai/extractous
[2] https://github.com/yobix-ai/extractous?tab=readme-ov-file#-s...
I especially like the approach to graalify Tika.
Just a thought.
https://github.com/maurycy/gemini-json-ocr
https://mathpix.com/pdf-conversion
That library is free for personal or open source projects, but paid for commercial ones
it would be easy to set up a pipeline like:
> drop pdf in s3 bucket > eventbridge triggers step function > sfn calls textract > output saved to s3 & emailed to you
is the simplest thing that might work.
It is free and mature.
OCR is another thing that might work which is also simpler than an LLM.
Works quite well
You could script it using Gemini via the API[1].
Or use Tesseract[2].
[1]: https://ai.google.dev/
[2]: https://github.com/tesseract-ocr/tesseract