The AI4EPO team developed novel AI models for the European Patent Office CodeFest on Green Plastics

Responding to the European Patent Office (EPO) first ever CodeFest on Green Plastics, Helvia's CEO Dr. Stavros Vassos joined forces with Dr. Dimitrios Skraparlis from EPO, NL and Dr. Prodromos Malakasiotis and Odysseas Diamantopoulos from AUEB, GR, forming the AI4EPO team. The purpose was to apply state-of-the-art AI to develop models for automating the classification of patents as green plastics, tackling one of today's key sustainability challenges.

The challenges and the approach

The team first had to agree on the definition of ‘green plastics’. As there is no standard definition, the team decided to rely on an EPO report and a green plastics cartography, as identified by experts.

The other challenge was that there are no labeled data on patent examples that belong to green plastics. To tackle this, the team generated lists of patents based on the cartography of the report and labeled them with the respective categories.

The third challenge was that the patents are too long, and there is no standard method to extract brief relevant information. The approach for this was to use automated summarization, as well as combinations of full-text title, abstract, description, and claims.

The methodology

The methodology the team followed consists of the below six steps:

1. Define green plastics

2. Label patent examples with respect to green plastics categories

3. Preprocess patents to extract a “patent DNA” per example

4. Train state-of-the-art AI pipelines for text classification

5. Evaluate the results and select the winning approach

6. Refine the winning approach toward a practical MVP

1. Define green plastics

The team relied on the categorization (cartography) laid-out by the experts of the study “Patents for tomorrow’s plastics”.

2. Label patent examples

To label the patent examples, the team curated a list of green plastics examples by executing queries on Google patent advanced search for each 3rd level entry of the cartography. Google patent advanced search was chosen because of its open support for searching the full-text of patents using Boolean syntax, proximity operators, wildcards and classification markings. The created google patent queries combine CPC subclass allocations and keyword constructs carefully selected to correspond to primary search strategies with similar or narrower search scope among the published queries used in the study “Patents for tomorrow’s plastics”.

Complexity constraints of google patents were worked-around through careful query and query-part building.

Following that, they created lists of “near-miss” examples to be used as negative examples that don’t belong to green plastics. The queries used combined CPC subclass allocations with targeted keyword negations.

The resulting data set effectively built upon samples of queries and CPC allocations generated and verified by human experts.

3. Preprocess patent examples - extracting a “patent DNA”

To preprocess patent examples, the team had to follow the below steps:

  • For each patent ID extract the title, abstract, description, and claims as text along with metadata using the EPO OPS service
  • Translate* to English for all parts not available in English
  • Summarize** title & abstract into 75 words, description into 180 words and claims into 150 words

The result was a new balanced dataset for green plastics classification with total examples including 4.3k patents, 2.2k positive and 2.1k negative.

The dataset includes three versions of extracted “patent DNA” for each patent, small (400 words), medium (1000 words), and large (1500 words) using summaries and full-text, with a total size of 597 MB. The three sizes of “patent DNA” enables the application of AI language models of varying complexity.

[* Google Translate was used for automated translation ]

[** OpenAI davinci-003 was used for automated summarization ]

4. Train state-of-the-art AI pipelines

To train the AI pipelines the team harnessed the power of Large Language Models (LLMs) using OpenAI and Cohere managed infrastructure and API:

  • Zero-shot: no examples, only a definition of the task is given to LLM
  • Few-shot (in-context) learning: 1 or 2 examples are given per class to LLM, multiple trials are executed, and majority vote is considered
  • Fine-tuning: dataset is used to finetune LLM
  • Custom MLP neural network: dataset is used to train a Multi-Layer Perceptron that employs LLM embeddings for its first layer

In addition, there were two pipelines per approach:

  • Binary: decide whether a patent is green plastics or not (yes/no)
  • Multi-label: decide which 2nd and 3rd cartography level a patent belongs to (pick a class or NEG otherwise)

5. Evaluate the results and select the winning approach

The winning approach was E2 – MLP with ada-002 embeddings, which was trained on the dataset for multi-label classification.

Note that in the table above for E2 and E3 we report the “aggregate” results, meaning that the AI model was trained to select a category of 2nd or 3rd level, but we only count whether it selected correctly that the patent belongs to green plastics or not with high confidence. In this way we get the binary decision (“Is it green plastics or not?”) along with some hints about why the model thinks it is classified in this way.

The evaluation findings showed that:

  • Automated summaries are weak: Crucial details for deciding green plastics seem to be missing. As a result, smaller models such as the BERT family that accept up to ~350 words are not expected to work well with automated summaries
  • Text generation LLMs do not perform well: Perhaps significantly larger datasets are needed; fine-tuning with 4.3k examples it was not possible to get decent performance
  • Text classification LLM embeddings are powerful: Using a custom Multi-layer Perceptron on top leads to near perfect binary classification, i.e. deciding if patent belongs to green plastics
  • Multi-label classification performs similar to binary: With multi-label classification we can also get an explanation of the yes response in terms of the 2nd or 3rd level cartography classes

The table below shows some indicative results using E2 that also provides an explanation:


6. Refine the winning approach toward a practical MVP

The use of the “patent DNA” of various sizes enables the exploration of cost-accuracy tradeoffs. The team further investigated further modifications of the winning solution, including the use of medium-sized patent DNA. Using a medium-sized patent DNA, approach E5 was introduced which aims at a lower cost/latency in production due to smaller tokens input. Compared against the winner (E2), the results were the following:

  • E5 cost savings vs E2: ~30% smaller input token counts
  • E5 accuracy penalty vs E2: small on binary decision (96.69% vs 98.8%), but significant on 2nd-level decision as the following reports show

Conclusions

The solution contains a comprehensive analysis of traditional and modern, state-of-the-art models and approaches, utilizing all available published expert information (green plastics cartography, queries, CPC subclasses) to create a new dataset. All proposed and tested AI pipelines of AI4EPO are by design, directly transferable to other base models and datasets.

The team proposes E2 that employs state-of-the-art LLM embeddings*, combined with a custom Multi-Layer-Perceptron neural network, producing excellent results on binary yes/no decisions, i.e. detecting whether a patent relates to green plastics or not. In addition, it offers information on why by providing classification to cartography entries of green plastics.

[* OpenAI model text-embedding-ada-002, published on 15/12/2022]

Next steps

The next steps include evaluation of the results on ground truth data using green plastics experts, employing a “Human in the loop” approach for generating a premium dataset and continuous improvement, following the experience of a similar project (A challenge on large-scale biomedical semantic indexing and question answering http://bioasq.org/).

Large dataset generation may be further streamlined and optimized using powerful EPO internal tools. Additionally, further input token size (“patent DNA”) optimization approaches can further optimize performance-accuracy tradeoffs.

Lastly, generating a premium dataset may further facilitate the multi-class approach and reduce the confusion between green plastics categories.