CoSyn: The open-source tool that’s making GPT-4V-level vision AI accessible to everyone

8 2 minutes read

Researchers at the University of Pennsylvania and the Allen Institute for Artificial Intelligence have unveiled a revolutionary tool that has the potential to level the playing field between open-source AI systems and proprietary models like GPT-4V and Gemini 1.5 Flash. This tool, known as CoSyn (Code-Guided Synthesis), addresses a major challenge in AI development by providing a solution to the scarcity of high-quality training data needed to teach machines to understand complex visual information.

Unlike traditional methods that rely on scraping images from the internet, which raises copyright and ethical concerns, CoSyn leverages the coding capabilities of existing language models to generate synthetic training data. This breakthrough was achieved during a collaboration between the PRIOR team at the Allen Institute for AI and supported by the Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, and the Defense Advanced Research Projects Agency.

The key innovation of CoSyn lies in its ability to generate synthetic data by utilizing the fact that text-rich images are often created through code. By leveraging language models’ coding abilities to generate the underlying code and then executing that code to create realistic synthetic images, CoSyn has achieved remarkable results in text-rich image understanding benchmarks.

Using a synthetic dataset of 400,000 images and 2.7 million instruction pairs, models trained with CoSyn have outperformed proprietary models on seven benchmark tests. Even their “zero-shot” model, trained without any examples from the evaluation datasets, surpassed most open and closed models, showcasing the transferability of capabilities learned from synthetic data.

One of the key strengths of CoSyn is its persona-driven approach, which ensures data diversity by pairing each generated example with a randomly sampled persona. This approach has enabled the system to generate content across nine different categories, including charts, documents, math problems, tables, diagrams, and more.

The implications of CoSyn extend beyond just improved benchmark scores. The technology is already finding real-world applications across industries, from automated document processing in financial services to quality control in manufacturing. By enabling companies to develop AI systems tailored to their specific needs without the need for massive data collection efforts, CoSyn is reshaping the landscape of AI development.

The commitment to openness is a core principle of CoSyn, with the complete codebase, dataset, and training scripts being made publicly available. This transparency not only addresses concerns about the black-box nature of proprietary AI systems but also allows researchers and companies worldwide to build upon the work.

As the research moves from academic laboratories to real-world applications, the potential applications of CoSyn are vast. From AI that understands sign language for the hearing impaired to systems that describe complex medical images for those with visual impairments, the technology has the power to transform how people interact with technology.

In conclusion, CoSyn represents a significant advancement in AI development, showcasing that open-source AI systems can compete with proprietary models through innovative approaches to fundamental challenges. The message is clear: in the pursuit of AI that can truly see and understand the world, creativity and ingenuity may be the keys to success.