infer_deepseek_ocr

infer_deepseek_ocr

About

1.1.0
MIT

DeepSeek-OCR document OCR to Markdown

Task: OCR
OCR
Markdown
DeepSeek
Vision-Language
Document

DeepSeek-OCR by DeepSeek AI, use groundbreaking approach to compressing long contexts via optical 2D mapping. This innovative system demonstrates that vision-based compression can achieve remarkable efficiency in handling text-heavy documents, potentially revolutionizing how large language models (LLMs) process extensive textual information.

The DeepSeek-OCR system consists of two primary components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Together, they achieve an impressive 97% OCR precision when compressing text at a ratio of less than 10× (meaning 10 text tokens compressed into 1 vision token).

benchmark

🚀 Use with Ikomia API

1. Install Ikomia API

We strongly recommend using a virtual environment. If you're not sure where to start, we offer a tutorial here.

pip install ikomia

2. Create your workflow

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_deepseek_ocr", auto_connect=True)

# Run on your image
wf.run_on(url="https://github.com/NanoNets/Nanonets-OCR2/blob/main/assets/bank_statement.jpg?raw=true")

# Display input
display(algo.get_input(0).get_image())
# Save output .json
deepseek_output = algo.get_output(1)
deepseek_output.save('deepseek_output.json')

☀️ Use with Ikomia Studio

Ikomia Studio offers a friendly UI with the same features as the API.

  • If you haven't started using Ikomia Studio yet, download and install it from this page.
  • For additional guidance on getting started with Ikomia Studio, check out this blog post.

📝 Set algorithm parameters

  • model_name (string, default: deepseek-ai/DeepSeek-OCR): Hugging Face model repo to load.
  • cuda (bool, default: auto): Use GPU if available. If set to True but no CUDA device is present, it will fall back to CPU.
  • prompt (string, default: "<|grounding|>Convert the document to markdown."): Text instruction appended after the image token to control the output style and task.
  • mode (enum, default: Gundam): Preset controlling resolution and cropping. One of: Tiny, Small, Base, Large, Gundam.
    • Gundam (Recommended): Balanced performance with crop mode # base_size = 1024, image_size = 640, crop_mode = True
    • Base: Standard quality without cropping # base_size = 1024, image_size = 1024, crop_mode = False
    • Large: Highest quality for complex documents # base_size = 1280, image_size = 1280, crop_mode = False
    • Small: Faster processing, good for simple text # base_size = 640, image_size = 640, crop_mode = False
    • Tiny: Fastest, suitable for clear printed text # base_size = 512, image_size = 512, crop_mode = False
  • test_compress (bool, default: True): Enable internal compression/fast path to reduce compute and VRAM. Turn off for maximum fidelity.
from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_deepseek_ocr", auto_connect=True)

algo.set_parameters({
'prompt': "<|grounding|>Convert the document to markdown.",
'mode': "Gundam",
'test_compress': "True",
})

# Run on your image
wf.run_on(url="https://github.com/NanoNets/Nanonets-OCR2/blob/main/assets/bank_statement.jpg?raw=true")

# Show input
display(algo.get_input(0).get_image())
# Save output .json
deepseek_output = algo.get_output(1)
deepseek_output.save('deepseek_output.json')

🔍 Explore algorithm outputs

Every algorithm produces specific outputs, yet they can be explored them the same way using the Ikomia API. For a more in-depth understanding of managing algorithm outputs, please refer to the documentation.

from ikomia.dataprocess.workflow import Workflow

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_deepseek_ocr", auto_connect=True)

# Run on your image
wf.run_on(url="https://github.com/NanoNets/Nanonets-OCR2/blob/main/assets/bank_statement.jpg?raw=true")

# Iterate over outputs
for output in algo.get_outputs():
# Print information
print(output)
# Export it to JSON
output.to_json()

Advanced usage

💡 Tips for Best Results

  • For receipts: Use "ocr" mode with "gundam" or "base" preset
  • For documents with tables: Use "markdown" mode with "large" preset
  • If text is not detected: Try different presets in this order: gundam → base → large
  • For handwritten text: Use "large" preset for better accuracy
  • Ensure images are clear and well-lit for optimal results

📝 Prompts examples

  • document: <|grounding|>Convert the document to markdown.
  • other image: <|grounding|>OCR this image.
  • without layouts: Free OCR.
  • figures in document: Parse the figure.
  • general: Describe this image in detail.
  • rec: Locate <|ref|>xxxx<|/ref|> in the image.
  • '先天下之忧而忧'

Developer

  • Ikomia
    Ikomia

License

A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.

PermissionsConditionsLimitations

Commercial use

License and copyright notice

Liability

Modification

Warranty

Distribution

Private use

This is not legal advice: this description is for informational purposes only and does not constitute the license itself. Provided by choosealicense.com.