diff --git a/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb b/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb index 0c75a78a5633012e5ddfec34db192db9379225e2..6f128be3e810882c6641626bccbbcbff4a210067 100644 --- a/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb +++ b/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb @@ -42,6 +42,16 @@ "#!pip install rich ipywidgets" ] }, + { + "cell_type": "markdown", + "id": "7b23d509", + "metadata": {}, + "source": [ + "Assuming you have a PDF uploaded on the same machine, please set the path for the file. \n", + "\n", + "Also, if you want to flex your GPU-please switch to a bigger model although the featherlight models work perfectly for this task:" + ] + }, { "cell_type": "code", "execution_count": 14, @@ -60,7 +70,6 @@ "metadata": {}, "outputs": [], "source": [ - "# Import necessary libraries\n", "import PyPDF2\n", "from typing import Optional\n", "import os\n", @@ -74,6 +83,14 @@ "warnings.filterwarnings('ignore')" ] }, + { + "cell_type": "markdown", + "id": "203c22eb", + "metadata": {}, + "source": [ + "Let's make sure we don't stub our toe by checking if the file exists" + ] + }, { "cell_type": "code", "execution_count": 9, @@ -91,6 +108,16 @@ " return True" ] }, + { + "cell_type": "markdown", + "id": "5a362ac3", + "metadata": {}, + "source": [ + "Convert PDF to a `.txt` file. This would simply read and dump the contents of the file. We set the maximum characters to 100k. \n", + "\n", + "For people converting their favorite novels into a podcast, they will have to add extra logic of going outside the Llama models context length which is 128k tokens." + ] + }, { "cell_type": "code", "execution_count": 10, @@ -144,6 +171,14 @@ " return None\n" ] }, + { + "cell_type": "markdown", + "id": "e023397b", + "metadata": {}, + "source": [ + "Helper function to grab meta info about our PDF" + ] + }, { "cell_type": "code", "execution_count": 11, @@ -169,6 +204,14 @@ " return None" ] }, + { + "cell_type": "markdown", + "id": "6019affc", + "metadata": {}, + "source": [ + "Finally, we can run our logic to extract the details from the file" + ] + }, { "cell_type": "code", "execution_count": 12, @@ -268,6 +311,22 @@ " print(f\"\\nExtracted text has been saved to {output_file}\")" ] }, + { + "cell_type": "markdown", + "id": "946d1f59", + "metadata": {}, + "source": [ + "### Llama Pre-Processing\n", + "\n", + "Now let's proceed to justify our distaste for writing regex and use that as a justification for a LLM instead:\n", + "\n", + "At this point, have a text file extracted from a PDF of a paper. Generally PDF extracts can be messy due to characters, formatting, Latex, Tables, etc. \n", + "\n", + "One way to handle this would be using regex, instead we can also prompt the feather light Llama models to clean up our text for us. \n", + "\n", + "Please try changing the `SYS_PROMPT` below to see what improvements you can make:" + ] + }, { "cell_type": "code", "execution_count": 60, @@ -297,6 +356,16 @@ "\"\"\"" ] }, + { + "cell_type": "markdown", + "id": "fd393fae", + "metadata": {}, + "source": [ + "Instead of having the model process the entire file at once, as you noticed in the prompt-we will pass chunks of the file. \n", + "\n", + "One issue with passing chunks counted by characters is, we lose meaning of words so instead we chunk by words:" + ] + }, { "cell_type": "code", "execution_count": 61, @@ -331,6 +400,14 @@ " return chunks" ] }, + { + "cell_type": "markdown", + "id": "5d74223f", + "metadata": {}, + "source": [ + "Let's load in the model and start processing the text chunks" + ] + }, { "cell_type": "code", "execution_count": 62, @@ -2550,6 +2627,14 @@ " out_file.flush()" ] }, + { + "cell_type": "markdown", + "id": "31cffe8d", + "metadata": {}, + "source": [ + "Let's print out the final processed versions to make sure things look good" + ] + }, { "cell_type": "code", "execution_count": 68, @@ -2617,7 +2702,9 @@ "id": "1b16ae0e-04cf-4eb9-a369-dee1728b89ce", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "#fin" + ] } ], "metadata": {