About the Project

Models for language-based applications such as speech recognition or NLP (natural language parsing) must be trained on large corpora of linguistically analyzed documents in the target language. The larger and better analyzed the corpus, the more robust the linguistic model.

DepCC is a large linguistically analyzed corpus in English built from a web-scale crawl of the Common Crawl project. DepCC, which includes 365 million documents, is composed of 252 billion tokens and 7.5 billion named entity occurrences in 14.3 billion sentences. A dependency parser and a named entity tagger were used to build a quickly searchable indexed corpus of all the sentences and their linguistic metadata. 

The key researchers leading the DepCC project are Eugen Ruppert, a senior research engineer at Hamburg University, and Alexander Panchenko, an assistant professor at Skolkovo Institute of Science and Technology. Having demonstrated that models trained on DepCC outperform models based on smaller corpora such as Wikipedia or SimVerb-3500, Ruppert is currently responsible for maintaining the health of the index, while Panchenko is focused on the development of applications related to the index, ranging from training syntax-based word embeddings to open information extraction and question answering.

About the ELK Stack

Today, DepCC comprises 16 servers, each of which is used as both a Hadoop cluster node for computation and as an Elasticsearch cluster node. The same server is used as master node for both Hadoop and Elasticsearch, and this master node, which is the publicly available endpoint server, also runs Kibana. 

The project managers use Cerebro for administration and monitoring of the Elasticsearch cluster, which contains about 18 billion documents (15 billion are from the DepCC index) and uses 14TB of disk space, including replica shards. 

The Elasticsearch Challenge

The cluster is located in the Hamburg University network and is accessible via the Internet to students and researchers with varying levels of Elasticsearch experience. However, Elasticsearch did not provide enough protection out-of-the-box, and an unsecured Elasticsearch installation was problematic, even in a closed network. For this publicly available deployment, the project managers definitely needed a reliable solution to secure the installation against the risk of data manipulation, or even destruction of the corpus—a solution which took several months to create and would require a similar effort to recreate. 

Furthermore, to enable access to as many researchers as possible, the project managers needed a user-friendly Kibana interface. Yet since their main requirements were user logins and access control, they explored the possibility of using X-Pack from Elastic. However, they found that the documentation regarding TLS setup was not straightforward and that the subscription was too expensive for a research group at a university. 

The ReadonlyREST (ROR) Solution

The project leaders continued to look for a solution that would keep their Elasticsearch index secure while being accessed freely by researchers and students around the globe. They found their solution in an academic license for the ReadonlyREST (ROR) PRO Kibana plugin

Leveraging the free ROR Elasticsearch plugin, which provides powerful and scalable access control, authentication, and authorization, the PRO plugin adds:

Customizable login screens.

Reliable one-click logouts.

Cluster-wide security settings.

  Kibana interfaces customized per user or group.

The ROR plugins, which have proven track records and are well supported, ensure secure read-only access to the DepCC corpus while supporting the Kibana interface that is so easy to use, even for those with little or no ELK Stack experience.

The ROR Benefits

The key benefits that ROR has brought to the DepCC project are:

A quick, easy, and well-documented installation process: The project team downloaded the respective plugin and installed it using just two commands. They then set up the configuration YML file and restarted the Elasticsearch node—and they were done. 

Seamless integration: No changes were required on their tool stack and compatibility was excellent with their third-party administration and monitoring tool (Cerebro).

Fine-grained configuration: The DepCC team can easily add new users, as well as add or modify access rights per individual index.

User-friendly interface: DepCC users can write queries in a familiar Google-like interface.

Zero maintenance: There have been no technical problems, and the only time the team needed to “work” on ROR was when they upgraded Elasticsearch.

Great support: Any questions about ROR were usually answered within minutes, and the ROR team always tried to provide a solution that would be optimal for the DepCC team’s needs.

Summary

The benefits described above were particularly valuable to a scientific project that has no dedicated IT team, letting team members concentrate on using the index for research without worrying about security or downtime. It also meant that external researchers and students could have unhindered yet secure access to the DepCC index, with access control and authentication now built into the Elasticsearch cluster.

In the words of the DepCC project leaders:

If you have an Elasticsearch installation with data that cannot be re-indexed within a day or two, you need a security solution. Users do not need to be malicious to break something or delete data. Therefore, it is important to have a multi-user environment in ES. In our opinion, paying for a mature security solution like ROR results in a net gain.”

Learn more about the ROR plugins and how they can secure and customize your ELK Stack based on fixed annual subscriptions for unlimited nodes. If you are an academic institution, feel free to contact us about the possibility of an academic license.

Related content

GDPR: What Have We Learned So Far?

The GDPR regulation went into effect in May 2018. What are the requirements, what’s the news on this topic and how can you ensure the security of your Elasticsearch data?