OCR for construction documents does not work, we fixed it (getanchorgrid.com)
67 points by wcisco17 4 hours ago
So we've built an API and trained models that detects fixtures, extracts schedules, and analyzes construction documents. Check us out!
More examples: - https://www.getanchorgrid.com/developer/docs/endpoints/drawi...
Main website: - https://www.getanchorgrid.com/developer
Why we did it: https://www.getanchorgrid.com/developer/docs/changelog/const...
petee 4 minutes ago
I ran the example doors given and it missed 9 swinging doors, some that were in double swing pairs, and a few that were just out on their own not clustered. Not bad overall though
Terr_ 2 hours ago
> OCR for construction documents does not work
I'm reminded of the Xerox JBIG2 bug back in ~2013, where certain scan settings could silently replace numbers inside documents, and bad construction-plans were one of the cases that led to it being discovered. [0]
It wasn't overt OCR per se, end-user users weren't intending to convert pixels to characters or vice-versa.
TehCorwiz 2 hours ago
If I recall it was an artifact of the compression algo.
Full context and details: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
sreekanth850 an hour ago
We’re taking a different path, building a parsing engine that converts CAD (DWG/DXF) into fully structured JSON with preserved semantics (no ML in the critical path).We also have a separate GIS parser that extracts vector data (features, layers, geometries) independently, Like to know how you handle consistency and reproducibility across runs using models and how you make it affordable, especially at scale. because as far as i know CAD and GIS need precision and accuracy.
oneneptune 11 minutes ago
Is this a service / product you plan to offer outwardly? I'd be interested in learning more. Use case: estimation.
i18nagentai 43 minutes ago
OCR accuracy on technical documents is one of those problems that looks 95% solved until you hit the edge cases. Construction docs are especially tricky because of mixed handwriting, stamps, revision clouds, and poor scan quality. Curious how you handle multi-language documents — a lot of international construction projects have specs in two or three languages on the same page.
frogguy 2 hours ago
Looks cool! Where are you getting the data to finetune the cv models for element extraction? I'm worried there isn't a robust enough dataset to be able to build a detection model that will generalize to all of the slightly different standards each discipline (and each firm for that matter) use.
wcisco17 an hour ago
good q — we don't train on customer drawings. Our detection models are trained on a curated dataset of architectural drawings we've sourced and labeled ourselves, focused on the most common fixture and element types across CSI divisions.
The generalization problem you're pointing at is real and it's the hardest part of this. Our approach is to keep the detection scope tight — rather than trying to generalize across every firm's conventions, we train on a small but high-quality set of fixtures and optimize for precision within that scope.
The result is high confidence outputs on the elements we support, rather than mediocre coverage across everything.
We're expanding the detection surface incrementally as we validate accuracy division by division!
dylan604 39 minutes ago
How in the world is an answer to a question from the account posting TFA replying directly to said question getting killed?
testUser1228 2 hours ago
What do you foresee being the end use case for this (or most valuable use case)?
wcisco17 2 hours ago
Anyone building in or for construction tech — whether that's a startup building estimating or project management software, a construction company with an internal tech team solving this themselves, or a builder looking to automate their workflow. The common thread is drawings. Every one of those groups lives and dies by their ability to extract actionable data from a PDF that was never designed to be machine-readable. We're building the layer that makes that possible so they don't have to start from scratch.
wang_li 2 hours ago
Why does the workflow lie at the level of a real or virtual piece of paper and not in the metadata from the applications used to create that piece of paper? Seems like a CAD tool would allow you to identify each element of the drawing, assigning metadata as required.
jsidney 2 hours ago
cyanydeez 2 hours ago
Iulioh 3 hours ago
When will this be available for 30000x8000px electrical diagrams?
I have to make a BOM and oh boy I hate my job
oritron 3 hours ago
What software made the bitmap? Seems like a step earlier in the pipeline could help generate a BOM more easily.
Iulioh 2 hours ago
I'm not really sure and I don't have access to it, I just recive flat PDFs or TIFFs
A lot of them are "archival" so I'm pretty OOL
dylan604 34 minutes ago
alexeischiopu 2 hours ago
I’m building a similar platform, with electrical being furthest ahead - SLD, panels, lights, power, comms.
Also do doors, windows, and mechanical equipment.
dm, and I can include you in the next preview.
testUser1228 an hour ago
I'm not sure how to dm on here, but I'm very interested
axus 12 minutes ago
Iulioh 2 hours ago
I work in the automotive field, I don't know if this complicates the things further but I appreciate any help!
jsidney 3 hours ago
What do you hate the most?
stronglikedan an hour ago
silly questions
hspraggins77 2 hours ago
Great points raised!
alexeischiopu 2 hours ago
Good idea :)
wcisco17 2 hours ago
Thanks!!
vessenes 2 hours ago
cool. What's pricing like?
wcisco17 2 hours ago
Thanks! https://www.getanchorgrid.com/developer/pricing
Let me know if you find it useful or have any questions, happy to help.
vessenes 2 hours ago
Thanks -- btw the Pricing link on the site pulls up a form, not that page.
achillesheels 3 hours ago
Love it! Starbucks Vente Machiato sip
Love to give it to an arc client, not sure who the right person to implement this would be? Hmm…
wcisco17 2 hours ago
Hey OP here - Love to help if you're looking for a team to implement a solution.
https://cal.com/anchorgrid/anchorgrid-external-meeting?durat...
fithisux 3 hours ago
Of course it is not working. PDF and images are supposed to be tamper resistant. OCR tries to reverse engineer them.
kube-system 3 hours ago
Since when is tamper resistance a part of PDF or any common image format?
pwagland 3 hours ago
PDF files can be signed, that is tamper resistance. Tamper resistance doesn't have to make any difference to the readability of the document.
kube-system 2 hours ago
ranger_danger 2 hours ago
fithisux 2 hours ago
You can't change a PDF, it is by design to be not easy to OCRed
kube-system an hour ago
ware-intel 2 hours ago
Your smart features looks like a game changer? Nice job!