efficient-longdoc-classification

https://github.com/amazon-science/efficient-longdoc-classification

📊 Stats

⭐ Stars: 48

📝 Language: Python

📝 Description: No description

⭐ Star Growth (12 months)

🔬 Research Notes

Stats

⭐ Stars: 48

🍴 Forks: 10

📝 Language: Python

📅 Created: 2022-07-14

🔄 Updated: 2026-01-30

🏷️ Latest Release: No releases

Description

No description

Topics

None

Research Summary

Key Features

Architecture

Use Cases

Assessment

Maturity:

Documentation:

Community:

Recommendation:

README Excerpt

```

Source codes for ``Efficient Classification of Long Documents Using Transformers''

Please refer to our paper for more details and cite our paper if you find this repo useful:

```

@inproceedings{park-etal-2022-efficient,

title = "Efficient Classification of Long Documents Using Transformers",

author = "Park, Hyunji and

Vyas, Yogarshi and

Shah, Kashif",

booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",

month = may,

year = "2022",

address = "Dublin, Ireland",

publisher = "Association for Computational Linguistics",

url = "https://aclanthology.org/2022.acl-short.79",

doi = "10.18653/v1/2022.acl-short.79",

pages = "702--709",

}

```

Instructions

1. Install required libraries

```

pip install -r requirements.txt

python -m spacy download en_core_web_sm

```

2. Prepare the datasets

Hyperpartisan News Detection

* Available at

* Download the datasets

```

mkdir data/hyperpartisan

wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip

wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/ground-truth-training-byarticle-20181122.zip

unzip data/hyperpartisan/articles-training-byarticle-20181122.zip -d data/hyperpartisan

unzip data/hyperpartisan/ground-truth-training-byarticle-20181122.zip -d data/hyperpartisan

rm data/hyperpartisan/*zip

```

* Prepare the datasets with the resulting xml files and this preprocessing script (following [Longformer](https://arxiv.org/abs/2004.05150)):

20NewsGroups

* Originally available at

* Running train.py with the --data 20news flag will download and prepare the data available via sklearn.datasets (following [CogLTX](https://proceedings.neurips.cc/paper/2020/file/96671501524948bc3937b4b30d0e57b9-Paper.pdf)).

We adopt the train/dev/test split from [this ToBERT paper](https://ieeexplore.ieee.org/document/9003958).

EURLEX-57K

* Available at

* Download the datasets

```

mkdir data/EURLEX57K

wget -O data/EURLEX57K/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip

unzip data/EURLEX57K/datasets.zip -d data/EURLEX57K

rm data/EURLEX57K/datasets.zip

rm -rf data/EURLEX57K/__MACOSX

mv data/EURLEX57K/dataset/* data/EURLEX57K

rm -rf data/EURLEX57K/dataset

wget -O data/EURLEX57K/EURLEX57K.json http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/eurovoc_en.json

```

* Running train.py with the --data eurlex flag reads and prepares the data from data/EURLEX57K/{train, dev, test}/*.json files

* Running train.py with the --data eurlex --inverted flag creates Inverted EURLEX data by inverting the order of the sections

* data/EURLEX57K/EURLEX57K.json contains label information.

CMU Book Summary Dataset

* Available at

```

wget -P data/ http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz

tar -xf data/booksummaries.tar.gz -C data

```

* Running train.py with the --data books flag reads and prepares the data from data/booksummaries/booksummaries.txt

* Running train.py with the --data books --pairs flag creates Paired Book Summary by combining pairs of summaries and their labels

3. Run the models

```

e.g. python train.py --model_name bertplusrandom --data books --pairs --batch_size 8 --epochs 20 --lr 3e-05

```

cf. Note that we use the source code for the CogLTX model:

Hyperparameters used

Hyperpartisan

```

---

*Researched: 2026-03-27*