Image Import¶

Before importing image and text data, the corpus needs to be processed (currently only supports image and text processing for Word and Excel).

Preprocessing Docx Documents¶

Directly supports splitting Docx documents containing images and text according to the specified character length.
Also supports manual splitting using the <split></split> tags to plan the document's paragraph divisions in advance.

For images in the Docx document, please paste them directly into the document (do not use shapes or text boxes to wrap the images) to avoid the program being unable to detect them and thus missing the image processing.

Preprocessing xlsx Documents¶

The xlsx file must conform to a fixed template format:

Q: Question, A: Answer.

For xlsx documents, please organize them according to the template requirements, and try to place illustrations within a single cell, avoiding spanning multiple cells.

Generating Image and Text Corpus¶

Log into the environment: https://console.d.run/ai-tools/lab? Password: aitools.
Upload the corpus file. Navigate to the directory /app/corpus_processing/input and upload the corpus file to this directory.
Click to run the code.
Download the generated image and text corpus file. Go to the directory /app/corpus_processing/output to download the zip file.
Clean up the environment. Clear the input and output files, as well as the running log files.

Note

This environment is public; it is recommended to perform the cleanup operation after handling private corpus files.

Importing the Downloaded Files¶

Click Corpus Import -> Image and Text Import.
Upload the processed file and proceed with vectorization, waiting for successful processing.